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ABSTRACT 


Certain methods of realizing numeric functions, such as sin(x) or Vx , In 
hardware involve a Taylor Series expansion or the CORDIC algorithm. These methods, 
while precise, are iterative and slow and may take on the order of hundreds to thousands 
of CPU clock cycles. 

A faster method involves a piecewise approximation to the function. The 
function value is computed by reading pre-calculated coefficients (slope and intercept for 
first order approximations). And then, by multiplying the function argument by the 
proper slope and adding the proper intercept, a close approximation to the function 
solution is produced. 

This thesis shows how this first order approximation technique was implemented 
on an FPGA-based COTS reconfigurable computer. MATLAB routines were developed 
to approximate the function as a set of consecutive, linear equations. The MATLAB 
approximation is combined with other modules designed in VHDL to construct an overall 
circuit. 

A pipelined circuit was created on the SRC-6E computer that reduces the latency 
of the sin(zx) function by over 88% and produces a result on each clock cycle. The 
circuit easily implements other functions by simply exchanging the approximation and 
coefficients. Thus, a user-friendly environment was created for calculating functions at 


higher speeds than the more popular current methods. 
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EXECUTIVE SUMMARY 


The need for high-speed numeric computation is greater now than it has ever 
been. Applications such as digital signal processing, graphics rendering and scientific 
calculations require the evaluation to numeric functions (e.g. sin(x) or log(x)) to be done 
quickly and repeatedly. Current methods of producing these solutions may involve a 
Taylor Series expansion or the CORDIC algorithm. These methods, while precise, are 


iterative and slow and may take on the order of hundreds to thousands of clock cycles. 


This thesis shows that the evaluation of numeric functions can be done much 
quicker by using a unique architecture. This architecture realizes f(x) as a piecewise 
approximation, f(x) ~ cix+co. The function solution is realized by storing pre-calculated 
coefficients, (slope, c; and intercept, co) and then accessing these coefficients from 
memory, when needed. By multiplying the function argument, x, by the proper slope, c;, 
and adding the proper intercept, co, a close approximation to the function solution can be 
produced. Additionally, this process can go beyond just simple functions to more 
complicated calculations, as needed. 

It is shown how this first order approximation architecture was implemented on 
an FPGA-based COTS reconfigurable computer. A pipelined circuit was created that 
reduces the latency of the sin(zx) function by over 88% and produces a result on each 
clock cycle. The circuit easily implements other function by simply exchanging the 
approximation and coefficients. Thus, a user-friendly environment was created for 
calculating functions at higher speeds than the more popular current methods. 

The process includes several steps using three software packages and two pieces 
of hardware. Initially, the desired function approximation is generated on a PC using 
MATLAB. MATLAB determines the piecewise linear approximation, and generates the 
VHDL code to be used by the circuit for coefficient look-up. The overall general circuit 
was initially built using Schematic Capture on the Xilinx ISE software package. The 
Xilinx and MATLAB generated VHDL code is transferred to the SRC-6E reconfigurable 


computer where it is combined with SRC-specific C code to create a macro that is 


X1X 


capable of performing the function calculations. From this point, different functions can 


be implemented by simply replacing the coefficient look-up VHDL code. 


XX 


I. INTRODUCTION 


A. CENTRAL PROBLEM AND PURPOSE 


The need for high-speed numeric computation is greater now than it has ever 
been. Applications such as digital signal processing, graphics rendering and scientific 
calculations require the evaluation of numeric functions (e.g. sin(x) or log(x)) to be done 
quickly and repeatedly. Certain methods of computing these functions involve a Taylor 
Series expansion or the CORDIC algorithm [1]. These methods, while precise, are 


iterative and slow and may take on the order of hundreds to thousands of clock cycles. 


Sasao, Butler and Riedel [2] have shown that numeric functions can be produced 
much more quickly by using a unique architecture. This architecture realizes f(x) as a 
piecewise approximation, f(x) ~ c;x+co. The function solution is realized by storing pre- 
calculated coefficients, (slope, c; and intercept, co) and then accessing these coefficients 
from memory, when needed. By multiplying the function argument, x, by the proper 
slope, c;, and adding the proper intercept, co, a close approximation to the function 


solution can be produced. 


Ref. [2] also discusses a method of using a look-up table (LUT) cascade in order 
to determine which segment a particular input value, x, corresponds. This is known as 
segment indexing. Nagayama, Sasao and Butler [3] show a recursive segmentation 
algorithm for dividing a function over its range and an alternate LUT cascade method 


based upon the edge-valued binary decision diagram (EVBDD). 


Frenzen, Sasao and Butler [4] discuss the relationship between the amount of 
memory needed and desired error constraints for three approximation methods. Ref. [5] 
shows how to design a LUT tree circuit for segment indexing, without the need for the 
designer to refer to a binary decision diagram (BDD). The characteristics of NFGs 
based-upon second-order approximations are analyzed in Ref. [6]. Cao, Wei and Cheng 
[7] discuss three different hardware algorithms that use second-order approximation. 


Their method uses floating point numbers, vice fixed-point numbers. 


The purpose of this thesis is to put theory into practice by demonstrating how the 
first order approximation architecture can be produced with a user-friendly interface and 
implemented on an FPGA-based COTS reconfigurable computer. It will then be shown 
that this architecture produces results with less latency and a shorter average time (for 


long blocks of calculations) than current methods. 


B. IMPLEMENTATION OVERVIEW 


The implementation process includes several steps using three software packages 
and two pieces of hardware. Initially, the desired function approximation is generated on 
a PC using MATLAB [8]. MATLAB generates the VHDL code to be used by the circuit 
for coefficient look-up. The general circuit is built using Schematic Capture on the 
Xilinx Integrated Software Environment (ISE) software package. The subsequent Xilinx 
and MATLAB generated VHDL code is transferred to the SRC-6E reconfigurable 
computer where it is combined with SRC-specific C code to create a macro that is 
capable of performing the function calculations. The SRC synthesizer will physically 
configure the circuit on the FPGA. From this point, different functions can be 
implemented by simply replacing the coefficient look-up VHDL code and re-synthesizing 


the circuit. 
In this thesis, all approximations are first order, and so the segments take the 
form: f(x)=c,x+c, where c,is the slope of the line and c, is the intercept of the line. 


The architecture of the numeric function generator (NFG) is shown in Figure 1. The 


independent variable, x, is shown at the top. 
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Figure 1. Numeric Function Generator (NFG) Architecture (After Ref. 1). 


Since the function is typically approximated by many adjoining straight-line 
segments, it is necessary to determine what segment is being used for the given value of 
x. This is done by the Segment Index Encoder, which produces a segment number 
associated with the value of x. After determining the appropriate segment, the 


coefficients, c,and c, are looked up in memory. These coefficient values are then used 


by the rest of the circuit to compute an approximation to f(x). The resultant value is the 


solution of the function. 
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sin(nx) [0,%) [0,1) 
cos(nx) [0,%) [1,0) 
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= In@) [1/256 .%) | (./—In(1/4) , /—In(1/256) ] 
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-(x logxx + (1-x) log,(1-x)) (0,1) 
1 
a be 1432" ) 
apes [0, V2 ] ee eee, 
an V2’ V2ne! 
Table 1. Functions of Interest and Their Domains. 


As identified in Ref [2], the specific functions of interest and the associated 


domains of their input values are above in Table 1. 


The rest of this thesis will discuss the process that is followed to implement the 


NFG on the FPGA. In addition, results will be discussed and compared. 


C. THESIS ORGANIZATION 


Chapter II discusses the approximation of the function and the calculation of the 
coefficients by MATLAB. Chapter III discusses the coding of the NFG in VHDL. 
Chapter IV explains the construction of the NFG circuit on the SRC’s FPGA. Chapter V 
discusses implementation results. Chapter VI provides a brief summary and suggestions 


for future work. 


I. FUNCTION APPROXIMATION 


The key to the whole NFG algorithm is the function approximation. The NFG 
uses the approximation to calculate the value of the function at any given point (in 
actuality, the output of the NFG is the value of the approximation, and not that of the 
actual function, but within error limits). The approximation must be pre-calculated; by 
dividing up the function into multiple, adjacent linear segments. The segment 
characteristics (e.g. endpoints, slope and intercept) are then used by the NFG for its 
calculations. Therefore, the NFG’s accuracy depends upon the accuracy of the 
approximation. The error can be made as small as desired by reducing the segment size. 
The tradeoff is memory size; as segment size decreases, more segments and more 


memory is needed. 


A. SEGMENTATION 


The approximation is generated with the aid of user-specified parameters by 


MATLAB (see Appendix A for the MATLAB M-files). 


The M-file, LinAppxPfit.m (short-hand notation for Linear Approximation using 
MATLAB’s routine “Polyfit”) is the “master” routine that calls the subordinate 
subroutines (the other M-files in Appendix A) as needed to perform its approximation. 
Routine LinAppxPfit.m shows the user a set of options that allows MATLAB to perform 
the approximation to fit the user’s needs. The question and answer format is designed to 
make the process as user-friendly as possible, little knowledge of MATLAB is required. 
The MATLAB interface looks like this: 


eee ee ee a ee eee 


LINEAR APPROXIMATION OF & FUNCTION USING POLYFIT with INTERCEPT SHIFTING 
[DEFAULT in BRACKETS] 


Input the Function, func[sin{pi*x)]: sin{pi*x) 

Input the Range of x - LOW value, x{low) [0]: >> 0 

Input the Range of x - HIGH value, x{high) [0.5]: 0.5 

(1)Non-uniform or (2) Uniform Segmentation or (3)Both [1]: 1 

Input the Desired Error, epsilon[2*-9]: 2*-9 

Input the no. of pts the fet is to be evaluated (per unit), N[10000]: 1000 
Input the equation to use: (1) F(x)=mx+b or (2) F(x)=m(x-p)+b, [1]: 1 





Figure 2. Linear Approximation Function User-Interface. 


First, the user 1s asked: 
1. Input the Function, func[sin(pi*x)]: 


The user then designates which function to approximate and implement. The 
default value for all the prompts are in the brackets []. In this case, the default value is 


sin(z7x) which has been the ‘test’ function throughout this project because of its 
simplicity and relevance. If the user simply presses ‘Enter’ at this point, the default value 
is used. 

Next MATLAB prints out: 


2. Input the Range of x - LOW value, x(low) [0]: 


The user then specifies the lower-end of the range of the independent variable of 


the function. Zero is the default value. Next, MATLAB types: 
3. Input the Range of x - HIGH value, x(high) [0.5]: 


The user then specifies the upper-end of the range of the independent variable of 


the function. One-half is the default value. Next, MATLAB types: 
4. (I1)Non-uniform or (2) Uniform Segmentation or (3)Both [1]: 

The user then specifies whether the approximation will be conducted where all the 
segments may have different lengths (non-uniform), the same length (uniform) or both. 
Non-uniform is the default. Further discussion on the advantages and disadvantages of 
non-uniform and uniform will follow. Next, MATLAB types: 
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5. Input the Desired Error, epsilon[2’-9]: 


This user then specifies the maximum acceptable error (the absolute value of the 
difference between the approximation and the actual value of the function over all 


points). The default value is 2° for 8-bit accuracy. Next, MATLAB types: 
6. Input the no. of pts the fct is to be evaluated (per unit), N[10000]: 


Since MATLAB performs calculations discretely (at fixed points), this option 
allows the user to determine the desired resolution when performing these calculations. 
If the function has radical changes throughout its range, a higher resolution may be 
desired. Of course, the higher the resolution, the more calculations there are, and thus 


more time is required to perform the approximation. Next MATLAB types: 
7. Input the equation to use: (1)F(x)=mx+b or (2)F(x)=m(x-p)+6, [1]: 


This feature is not used, but would allow the user to have the output produced 
with an extra coefficient to be used on a different architecture. Default is to use the 


standard equation of a line: y=mx+b. At some point in the future, this option may be 


eliminated and the second option chosen automatically. 


If uniform segmentation (2) is chosen in question 4, then the user is prompted 
with: 

8. Would you like to constrain (1)Number of Segments or (2)Error [1]: 

This allows the user to determine how an approximation using uniform 
segmentation is conducted. The default is to constrain the number of segments, usually a 


power of 2. If the user chooses to constrain error, he/she will be prompted with question 


5 above. If /// Number of Segments is used, then the following appears: 
9. Input the number of Desired Segments[16]: 


This prompts the user to designate the number of desired segments for a uniform 


approximation. The default is 16, which is easily represented by a 4-bit number. 


Upon completion of the user inputs, MATLAB computes the approximation and 


returns to the user the following: 


Segment endpoints in both decimal and binary fixed point 


representation for each segment 


Segment slope and intercept in both decimal and binary fixed point 


representation for each segment 


A graphical representation of the function with the approximation 


overlaid 


A graphical representation of the error throughout the range of the 


approximation, with maximum error highlighted 


VHDL code that describes the segment index and coefficient look-up 
to be used by the SRC (behavioral code) 


KREAKKEEAAKREREKEEAEREEEEKEEEERETEREEREREEREREEREREEEEREEEEREEEREEEEREETEREEEEREEEERE 


NON-UNIFORM Segmentation 
Segment End Point 
Nurber (Decimal) 


QO 


Aun bwoNH OC 
o.C 6:0 6:6 


-121224 
.200940 
-269054 
-331366 
-390378 
- 447590 
-499900 


End Point cil cil co co 

(Binary) (Decimal) (Binary) (Decimal) (Binary) 
0000000.000111110 3.07373 0000011.000100101 0.00105 o000000.000000000 
0000000.001100110 2.74354 0000010.101111100 0.04085 o000000.000010100 
oo000000.010001001 2.32099 0000010.010100100 0.12564 oo000000.001000000 
o0000000.010101001 1.84314 0000001.110101111 0.25413 o0000000.010000010 
0000000.011000111 1.32869 0000001.010101000 0.42456 0000000.011011001 
0000000.011100101 0.79036 o000000.110010100 0.63469 o000000.101000100 
0000000.011111111 0.25817 oo000000.010000100 0.87262 o0000000.110111110 

ee ee ee ee ee eee 
Figure 3. MATLAB Segmentation Output. 


Figure 3 shows a typical MATLAB output after conducting the approximation 


and segmentation of the user-specified function using the LinAppxPfit routine. In this 


case, our test function sin(zx) is defined from 0 < x < 0.5, using non-uniform 


segmentation with a maximum error of at 2”. The routine produced a 7 segment 


approximation. All fixed-point binary numbers are in a signed, twos-complement, 7.9 


format (7 bits to the left and 9 bits to the right of the implicit binary point). This format 


allows a number between -64 < x < 64 to be represented. 


NON-UNIFORM f(x) segmentation. No. of segments = 7. 
1 4 T T T T T T T T 
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Figure 4. Graphical Representation of Function and Approximation. 


Figure 4 above shows the graph of the approximation produced by MATLAB. In 
this figure, the approximation is superimposed on the actual function. Segments in the 
approximation are shown as straight lines colored red and blue alternating. Figure 5 
below is a “close-up view” on the same figure where the 3™ and 4" segments meet. This 
shows that the segmentation truly is an approximation. The red and blue lines at this 
point are slightly higher than the actual function. Also, the two segments do not overlap 


and due to discreteness there is a slight gap between the two segments. 


NON-UNIFORM f(x) segmentation. No. of segments = 7. 
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Figure 5. Close-up of Approximation. 


Figure 6 shows how the error between the approximation and the actual function 
changes across the segmentation. Alternating curves of red and blue correspond to each 
of the segments. As can be seen, the error does not exceed that specified by the user. 
The numerical value of maximum error is displayed on the x-axis. The magnitude of the 
maximum positive and negative error in each segment is of equal magnitude. In the 
above case, the error starts negative; the difference between the actual function and the 
approximation is negative, therefore the approximation is greater than (above) the 
function. When the error curve crosses the zero axis, this is where the straight-line 
approximation segment intersects the actual function. The segment (in order from left to 
right) starts above the function, intersects the function, goes below the function, intersects 


the function again and then finishes up above the function, for this example. 
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X 10° Error for NON-UNIFORM f(x) segmentation. No. of segs = 7. 
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Figure 6. Error Across Approximation. 


As previously discussed, there are two methods to approximating a function; non- 
uniform and uniform segmentation. Each method has its own advantages and 


disadvantages. 


1. Non-Uniform Approximation Algorithm 


With non-uniform approximation the width of the segment is chosen to be as 
large as possible so that the given error is not exceeded. In regions of the function where 
it is linear, the segments will be large in comparison to regions where the function’s 
curve changes rapidly. This is where the second derivative (the rate of change of the 
slope) is changing the most. Non-uniform approximation results in fewer segments than 
uniform segmentation. But, the disadvantage is that the determination of which segment 
a particular input value belongs to is more complex and requires the use of a segment 


index encoder which must examine all of the input bits of x. 


The MATLAB function multiplelinapprox.m (short for multiple line 


approximation) performs the calculations for non-uniform approximation of the function. 
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Multiplelinapprox.m does this with the aid of varlinapprox.m (short for variable line 
approximation) which it calls during its execution. Both of these M-files are shown in 


Appendix A. The algorithm also depends on the MATLAB function polyfit. 


The non-uniform approximation algorithm uses the following procedure, starting 


from left to right through the range of the input values to the function. 


1. Beginning with two points, polyfit calculates the two coefficients (slope 
and intercept) for a first order approximation of a given set of points. 

2: Maximum error is determined between the approximation and the actual 
function. 

3: The approximation is shifted vertically by half the distance of the 


maximum error, towards the actual function, so that it intersects the 
function somewhere besides its endpoints. 


4. Error between the newly-shifted approximation and the actual function is 
recalculated. 

D: Error is checked against the user-specified maximum error to ensure it has 
not been exceeded. 

6. If maximum error has not been exceeded, the next adjacent point of the 
function is added and the process repeats with step 1. 

7. If maximum error has been exceeded, after many iterations, then the 
previous segment is used where the maximum error requirement is not 
exceeded. 

8. The endpoint of the given segment is recorded and the process restarts at 


the next given point. 


2. Uniform Approximation Algorithm 


With uniform approximation, each segment is the same length. Therefore, each 
segment is restricted to the size of the shortest segment. The algorithm must determine 
what the largest size can be of the shortest segment which still meets the user’s error 
requirements. The function is then divided into segments of this length. The advantage 
of uniform approximation is that the process of indexing which segment a particular input 
belongs to becomes much easier. If the number of segments is a power of 2, then the 
circuit only needs to examine the log: number of bits of the input ‘%’ in order to 


determine what segment the input indexes. Therefore, when initially conducting the 
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approximation, after determining the number of segments to meet the error requirement, 
it is best to increase to the next power of 2. The circuit must examine the same number 
of bits, but the approximation will have a smaller error due to more segments. This will 
be achieved with only a slightly higher memory requirement. For a first order 


approximation with 16-bit coefficients, the difference in memory is: 
A memory = 2 x (S€Spower of 2 — S€ZSuniform ) X 16-bits (1) 


Figure 7 shows the MATLAB output for the test function, sin(zx), defined from 0 
<x < 0.5, using uniform segmentation with 16 segments. As shown, the segments are 
neatly distributed as defined by the 4 bits from bit position 4 — 7 (rectangle with dotted 


line). 


UNIFORM Seqmentation 








Segment End Point End Point 
Number (Decimal) (Binary) 
Oo 0.031009 ooooo0ono. 
i 0.062218 ooooo0o00. 
2 0.093325 ooo00000.d0010:1111 
rs | 0.124437 oo000000. 00111111 
4 0.155546 ooooo000.c i 
5 0.166755 ooooo00.c 
6 0.217864 oooo0000.¢ 
a7 0.248973 oooo0000.c¢ 
3 0.280083 ooooo0c0.cd 
9 0.311292 ooooo00.¢ 
10 0.342401 oo000000.c 
u ie Os373510 oooo000.c 
a be} 0.404619 oooo000.¢ 
13 0.435829 ooo0000.c¢ 
14 0.466938 ooooo0o00. 
15 0.498047 ooo0000.cg 
Figure 7. MATLAB Output for Uniform Approximation. 


The MATLAB functions constantlinapprox.m and constantlinappxwerr.m 
perform the calculations for the uniform approximation of a function. The first function 
constantlinapprox.m (short for constant linear approximation) performs the calculations 
using a user-specified number of segments. The second function constantlinappxwerr.m 
(short for constant linear approximation with error) uses the maximum allowable error as 
the parameter to determine the segmentation. Both of these M-files are shown in 


Appendix A and call upon polyfit. 
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The uniform approximation algorithm using constantlinapprox.m performs the 


following: 


1. 


Divide the length of the domain of the function by the number of user- 
defined segments. 


Using polyfit, determine the slope and intercept of each segment. 


For each segment, determine the maximum error between the segment and 
the function. 


Shift the approximation segment vertically one-half the distance of the 
maximum error, towards the actual function, so that it intersects the 
function somewhere besides its endpoints. 


The uniform approximation algorithm, using constantlinappxwerr.m, performs a 


different procedure that is a hybrid between the two previously discussed procedures. 


The procedure is as follows: 


I: 


11. 
12: 


Determine at what point on the function the second derivative is the 
greatest [4]. 


From this point, move outward to the left and right by one point on the 
function. 


Using polyfit, determine the slope and intercept of a linear approximation 
of the points. 


Determine the maximum error between the linear segment and the actual 
function. 


Shift the approximation segment one-half the distance of the maximum 
error, towards the actual function, so that it intersects the function 
somewhere besides its endpoints. 


The error between the newly-shifted approximation and the actual 
function is recalculated. 


The error is checked against the user-specified maximum error to ensure it 
has not been exceeded. 


If the maximum error has not been exceeded, the process repeats with step 
2: 


If the maximum error has been exceeded, then the previous segment is 
used where the maximum error requirement was not exceeded. 


Determine the number of segments by dividing the entire length of the 
function by the size of the segment created. 


Using polyfit, determine the slope and intercept of each segment. 
For each segment, determine the maximum error between the segment and 


the function. 
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13. Shift the approximation segment vertically one-half the distance of the 
maximum error, towards the actual function, so that it intersects the 
function somewhere besides its endpoints. 


It is likely that a user would run both uniform approximation algorithms. First, 
the constantlinappxwerr.m would be used to provide the user an idea of how many 
segments are needed in order to meet the error requirements for a particular function. 
After obtaining this number, the user would run the constantlinapprox.m algorithm where 
the next power of 2 is provided as the number of segments desired for the approximation. 
In this way, both the error requirements are met and the indexing is simplified by only 


using a few bits to determine which segment to use. 


B. MATLAB RESULTS 


Table 2 below shows the number of segments required for 8 and 16-bit precision, 
of given functions, as generated by MATLAB. It is interesting to note how many 
segments are needed for different functions. Some functions, such as 2" are more suited 
to a uniform segmentation implementation since the difference between uniform and non- 


uniform is small. Non-uniform requires five segments versus uniform segmentation 
which requires six. Other functions, such as /—Inx are much more suited for non- 


uniform approximation. In this case, non-uniform approximation requires twelve 


segments where uniform approximation requires 145. 


C. SUMMARY 


As shown, an important step in the NFG implementation process is the 
approximation. This is done by dividing the function into multiple, consecutive 
segments. The segments can either be of uniform or non-uniform length. Usually the 


approximation is developed to meet certain error requirements. 
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After the function approximation is complete, the circuit can be accurately 
described using the information gathered and incorporating it into an HDL. This process 


will be described in detail in the next chapter. 










































































Function Interval Non-Uniform Uniform 
Seo x 8-bit 16-bit 8-bit 16-bit 
2 [0,1) 5 75 6 91 
Ix [1,2) 5 75 8 130 
Vx [0,2) 10° 216 | 8,206 | 5.38x 10°" 
L/tx [1,2) 4 50 5 79 
logx(x) [1,2) 5 76 7 110 
Inx [1,2) 4 63 6 91 
sin(7x) [0,2) 7 109 9 144 
cos(1x) [0,2) 7 108 9 148 
tan(7x) [0,%) 5 73 9 144 
1 1 P ‘ 
./—In(x) l=) 12 216 145 2,507 
256 4 
tan’ (nx)+1 [0,%) 10 152 18 291 
-(x logox + (1-x) log2(1-x)) (0,1) 16° 342° 136° 34,787" 
1 
= [0,1) 2 21 2 28 
l+e° 
{See 
e? 0, V2 4 43 5 81 
Px ae, 
* from Ref[5]. 
Table 2. Number of Segments for Non-Uniform and Uniform Segmentation for 8 


and 16-bit Precision. 
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Hil. NFG CIRCUIT 


A. CIRCUIT OVERVIEW 


The next step in the process is to design the circuit using software design tools. 
This is done using a combination of software tools and techniques, including Xilinx’s 
Integrated Software Environment (ISE) [9,10], Mentor Graphic’s ModelSim [11], 
Synplicity’s Synplify [12,13] and standard IEEE Behavioral VHDL [14,15,16]. The end 
product is a behavioral description of a pipelined circuit written in VHDL which is used 


by the SRC computer for implementation on a FPGA. 


The top-level circuit is built with Xilinx ISE’s Engineering Capture System (ECS) 
tool. The ECS is a schematic editor tool in which a circuit can be built visually by simply 
connecting parts together. Some of the parts are already available as Xilinx primitives, 


such as flip-flops, grounds and power (VCC). Other parts were custom built. 


Numeric Function Generator 


16-bit(7.8 in, 7.8 out, 7.9 working), 8-bit accuracy, signed, general nfg 






FDI6CE 







munhi6x165 
2. ae 


ba © 
ax 





Figure 8. NFG Top-Level Schematic. 


Figure 8 is the schematic of the NFG as built in Xilinx ECS. The schematic 
shows the input, x, coming into the circuit at the top left. The input passes through the 
slopeintlu (short for slope & intercept look-up) module that produces the slope and 
intercept coefficients to be used by the rest of the circuit. The input, slope and intercept 
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are all clocked into registers. On the next clock, the slope and input, x, pass through a 
signed multiplier. The product is then clocked into a register. The intercept simply 
passes it on to the following register to be saved until it is needed. On the last clock, the 
product and the slope are added together to produce a signed sum. This is the value of 


the function. 


After constructing the circuit schematically, the VHDL code that describes the 
circuit can be extracted within the Xilinx ISE. This code will be used by the SRC 
reconfigurable computer. All VHDL code is shown in Appendix B. 


B. CIRCUIT COMPONENTS 


1. Slope and Intercept Look-up 


The heart of the NFG is the Slope and Intercept Look-up module. In this 
implementation, the computation of the slope index and subsequent output of the slope 
and intercept coefficients is conducted all in one module. This varies from the block 
diagram shown in Figure 1 which shows separate process for indexing and coefficient 
output. This module is described behaviorally by VHDL code, which is automatically 
generated by MATLAB during its function approximation algorithm (item number 5 on 
page 8). The VHDL code uses a set of If/Then/Else statements to describe the module. 
For example, if the input value x is greater than 0.121224 but less than 0.200940, then the 
slope equals 2.74354 and the intercept equals 0.04085 (see Figure 3). By using 
behavioral VHDL, the FPGA synthesizer has maximum flexibility to construct the actual 
circuit on the FPGA. The user can review the construction of the circuit through various 


output files and reports created during synthesis to see how the circuit was developed. 


By simply replacing the Slope and Intercept Look-up module with another, a 


different function can be generated. 
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2. Multiplier 


The 16 by 16 signed-arithmetic multiplier is also developed in behavioral VHDL. 
By using VHDL, the characteristics of the multiplier, in terms of bits used for input and 
output, can be explicitly specified. In this case, two 16-bit numbers are multiplied. 
Although this could potentially produce a 32-bit output, in this circuit, a 16-bit output is 
produced by simply extracting the middle 16-bits of the product (bits 9 through 24). No 
rounding algorithm is used. The synthesizer will build the multiplier on the FPGA from 


this behavioral description using the resources available on the target chip. 


3. Adder 


In a similar way, the 16 by 16, signed-arithmetic adder module is designed in 
behavioral VHDL. Two 16-bit numbers are added together. The sum is the output of the 
circuit which is also the value of the function. The value of the function is expressed in a 
15-bit, signed, two’s complement, fixed-point format. The 15-bit representation is 


produced by removing the LSB of the 16-bit sum. 


The additional the components in the circuit, specifically the FDI6CE (16-Bit 
Data Registers with Clock Enable and Asynchronous Clear), ground and VCC (power) 


are standard IEEE components. 


4. Number System 


The circuit was designed so it could be used for as many of the target functions as 
shown in Table 1 as possible. Therefore, the design incorporates 15-bits, fixed-point for 
input and output. The 15-bits are distributed with 7 bits to the left of the decimal point 
and 8 bits to the right. This provides an 8-bit accuracy (2° is the lowest resolution which 
can be represented). In order to improve the accuracy of the circuit, 16-bits are used for 
the working calculations in a 7.9 format (2” resolution). A signed 16-bit number in 7.8 
format can represent values between -64 < x < 63.998 (64 minus 2°), 
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C; SUMMARY 


As shown, the design of the circuit is done using mainly the Xilinx ISE software. 
The top-level circuit is constructed in the Xilinx ECS by laying out the circuit 
schematically. Individual components are either readily available as part of the standard 


Xilinx primitives or are custom designed in VHDL. 


The ModelSim software package allows the designer to verify the operation of the 
circuit. ModelSim integrates with Xilinx ISE. The designer can create test bench 
waveforms to input into the circuit. Various outputs can be placed strategically 
throughout the circuit to verify that the proper signals are being passed. When the circuit 
is not operating properly, this is a great troubleshooting tool. The designer can pinpoint 


down to specific components in order to determine where the fault exists. 


After verifying the proper operation of the circuit, the VHDL code is ready to be 
transferred to the SRC computer for circuit implementation. The SRC uses Synplify 
[12,13] for its synthesis tool. A good engineering practice is to first synthesize the 
VHDL code in a stand-alone version of Synplify. This is because, when designing a 
circuit in the Xilinx ISE, the Xilinx XST synthesizer is used to verify the circuit 
construction. There have been some differences noted between the two synthesizers. 
Software code which works in one may not work in the other. This anomaly is further 
discussed in Appendix E, Lessons Learned. Therefore, it is a good design check to 


ensure that the code works with the synthesizer that the SRC uses. 


Chapter IV will go into greater detail on how to finish implementing the circuit on 


the SRC and its FPGA. 
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IV. SRC IMPLEMENTATION 


SRC Computers of Colorado Springs, CO was founded by the legendary Seymour 
R. Cray, who lends his initials to the company’s name. SRC developed the SRC-6E 


reconfigurable computer that is the target architecture for the construction of the NFG. 


The SRC-6E computer at NPS provides a unique architecture where an Intel- 
based PC is interfaced with SRC’s proprietary MAP processing boards. The MAP is the 
heart of the system where the FPGA resides. On the MAP are three Xilinx XC2V6000 
FPGAs and dual-ported memory. Only two of the FPGAs can be programmed; the other 
FPGA performs control functions. It is through this unique architecture and its associated 
interface that the NFG is implemented on the FPGA. More information on the SRC 
computer is available in Ref [17]. More information on the Xilinx XC2V6000 FPGA is 
available in Ref [18]. 


A. SOFTWARE CODE 


The SRC system develops a pipelined circuit on the FPGA. It has the ability to 
implement the design from software code written in C, FORTRAN, VHDL or Verilog. 
Along with this code, there are a few SRC specific files which must be included in order 


to synthesize the design on the FPGA. 


Specific files which must be provided (see Appendix C) in the project directory 
include: 


1. main.c 


This is the main routine, written in C, which runs on the SRC’s Intel processor. 
This routine is used to interact with the user and the .mc subroutine which runs on the 


MAP processor board. 
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2. <subroutine>.mc 


Called by main.c this subroutine, also written in C, executes on the MAP’s 
FPGAs to perform a certain function. In order to take advantage of the benefits of an 
FPGA, this is where the programmer would place computation intensive portions of an 


algorithm. Results of the computations are returned back to main.c. 


3. makefile 


Commonly used in C/C++ programming, the makefile tells the compiler which 
files to use when compiling. A standard makefile is provided by SRC and simply 


modified to accommodate the programmer’s unique code. 


4. Macros 


Macros allow a programmer to more explicitly design a function on the FPGA. 
Macros are called by the .mc files and are typically written in an HDL, such as VHDL or 
Verilog. By writing in one of these languages, the programmer can manipulate the circuit 
down to the individual bit level. In this way, operations can be done on any combination 
of bits. Also, the bits can be combined or split-up as necessary. Unlike C programs, 


macros must be manually pipelined. 


In order to pipeline a circuit, the programmer must place registers in-between 
functional modules. As shown in Figure 8, 16-bit registers are inserted in between the 
slope & intercept look-up, multiplier and adder modules. Pipelining a circuit requires the 
programmer to determine how much work can be done in one clock cycle and then place 
the register to store the results until the next clock cycle. Pipelining allows for 
subsequent calculations to occur simultaneously, as needed results are held until they are 
used. Therefore, even though a calculation may take several clock cycles, once the 


pipeline is full, a result will be produced each clock cycle. Although a result may be 
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produced every clock cycle, it usually takes more than one clock cycle to do a complete 


computation- beginning to end. This is known as latency. 


In Figure 8, the pipeline is three clocks deep. Therefore, its latency is 3. A result 


is produced at the output every clock cycle. 


The following files are needed when implementing a macro: 


a. info 


The info file specifies the characteristics of the macro. It tells the compiler 


whether the macro is pipelined and specifies its latency, among other parameters. 


b. blk.v 


The b/k.v file commonly known as the “black box” file specifies the macro 


interface. It describes the inputs and outputs (to include bit width) to and from the macro. 


GQ ADL Files 


Written either in VHDL or Verilog, the HDL files describe the operation 
of the circuit. In VHDL, the circuit can be described with behavioral, dataflow or 
structural modeling. All the files in use must be listed in the makefile. They will have a 
suffix appropriate to the language so that the synthesizer knows how to interpret them, 


such as ‘.v’ for Verilog, *.vhd’ for VHDL and ‘.c’ for C programming. 


For the NFG circuit, the VHDL files discussed in Chapter III are provided 
to the SRC macro. These files behaviorally describe how the circuit should operate. The 
SRC synthesis tool, Synplify by Synplicity, is the tool which translates the behavioral 
VHDL code into the actual implementation on the FPGA. 
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B. SUMMARY 


As shown, in order to implement the NFG on the SRC’s FPGA, the user must 
provide all of the necessary information to the SRC system. This information is in the 
form of various files that provide the details of the circuit. Once of all the information is 
provided, the SRC is able to construct a pipelined circuit on the FPGAs resident on the 
MAP processing board. 


The relatively easy user-interface of the SRC allowed for various design 
implementations to be created on the system. These different design yielded some 


interesting results to be discussed in the next chapter. 
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Vv. IMPLEMENTATION RESULTS 


Various implementations of the NFG were constructed and compared. Speed and 


latency were the dominant characteristics used for comparison. 
A. PC C++ IMPLEMENTATION 


As a basis of comparison, an NFG was constructed using simple C++ code on an 
INTEL PC. Specifically, f(x)=sin(zx) was realized. Such computations usually are in the 
form of a Taylor Series expansion or the CORDIC algorithm [1]. 


Because of the coarse timing information provided by the PC, it was necessary to 
calculate the sin(xx) 100 million times (10°). The time to perform these calculations is 
determined and is then divided by the total number of calculations to compute a time per 
calculation. In this way, an average time per calculation for the sin(zx) was determined 


to be approximately 130 nanoseconds. ! 


This will serve as the baseline to compare the NFG’s performance as 
implemented on the SRC. Of course, CPUs are always increasing in performance. But, 
FPGAs are also improving as well. Also, the FPGA has several key features of which the 
programmer can take advantage. Namely, the FPGA is virtually a blank slate, which can 
implement any architecture up to the limitations of the resources on the FPGA chip. As 
we will see, parallelism and pipelining can be exploited to increase an architecture’s 


performance. 
B. SRC IMPLEMENTATION 


The SRC system allows the programmer to choose a number of different methods 


to implement a function. The system recognizes functions written in C and FORTRAN?, 


1 C++ routine performed using Microsoft Visual Studio 2005 [19] on a Toshiba Satellite M30X laptop 
with an Intel Celeron M processor, 1.30 GHz, 768 Mb RAM. 


2 The FORTRAN compiler is not available on the NPS SRC-6E. 
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as well as macros written in HDLs, such as VHDL and Verilog. The system will take the 
software code written in any of these languages and translate it into a pipelined circuit on 
the MAP hardware’s FPGA processor. During compilation, the SRC system provides the 


user with several reports. 


Among the most useful are the Inner Loop Summary and the Place and Route 
Summary. These provide the programmer with considerable insight into the workings of 


the circuit. 
The Inner Loop Summary has only 3 output lines, listed below (not in order): 


1. Pipeline depth indicates how deep the pipeline is for a particular loop. 
Therefore, it is also an indication of latency for that loop. If an input is applied at time t¢ 


to the loop. The result of that loop will be available at time ¢ + pipeline depth. 


2. Clocks per iteration indicates how many clocks between each 
successive output. In most cases, once the pipeline is full, each successive iteration will 
come out one clock later with a delay of pipeline depth from when its input was applied. 
Nevertheless, there are cases where there may be two or more clocks per iteration. In 


these cases, the programmer will most likely want to adjust the program to prevent this. 


3. Loop on line ‘n’ indicates to which loop in the program the report 
applies. If the program has multiple loops, each loop will have its own Inner Loop 


Summary. 


The Inner Loop Summary is typically an accurate indication of how fast a 


program will run. 


The other very useful report produced by the SRC compiler is the Place and 


Route Summary. This report provides the following information: 


1. Number of Slice Flip Flops used, usually expressed as a number “‘n’ out 
of ‘m’ and a percentage. These are 1-bit flip-flops that are resident on the FPGA 
within the slices. There are two flip-flops per slice. For the Xilinx Virtex-2 


XC2V6000, m equals 67,584. 
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2. Number of 4 input LUTs used, also expressed as a number ‘n’ out of 
‘m’ and a percentage. Each LUT can realize any 4-variable logic function. There 


are two LUTs per slice for a total of 67,584 (=m). 


3. Number of occupied Slices used, expressed as a number ‘n’ out of ‘m’ 
and a percentage. A slice is the basic unit in the FPGA. Each slice has two of 
flip-flops and two LUTs, in addition to other logic. For the Xilinx Virtex-2 
XC2V6000, m equals 33,792. 


4. Number of MULT18X18s used, expressed as a number ‘n’ out of ‘m’ 
and a percentage. These are high-speed multipliers resident on the chip. The 


XC2V6000 has 144, 18 by 18, signed multipliers. 


5. Freq (short for frequency) indicates at what speed the FPGA will 
operate. Frequency is determined by the synthesizer and varies depending upon 


the structure of the circuit, the target speed is 100 MHz. 


The Place and Route Summary thus indicates how much of the FPGA is occupied 


by the circuit and how fast the circuit will operate. 


1. SRC C Code Implementation 


a. Use of C Library Functions 


The FPGAs on the SRC can be programmed using C, FORTRAN or an 
HDL, either Verilog or VHDL. To compare with an ordinary C program running on a PC 
as described above, we also implemented the sin(zx) function in the SRC’s FPGAs using 
the ‘sin’ function in the SRC standard library (libmap.h). To test this program, random 
values of ‘x’ were generated and sent to the MAP processor where the test function 
sin(mx) was computed. The /nner Loop Summary report for this implementation specified 


a pipeline depth of 104 clocks with one clock per iteration. 


To further understand the circuit, this implementation was also modified 


where the z in sin(zx) was removed so that only the sin(x) was computed. In this case, 
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the pipeline depth reduced down to 89 clock cycles. This suggests that 15 out of the 104 


clock cycles for sin(zx) are solely for the floating point multiplication of z times x. 


All of the SRC source code is in Appendix C. 


b. If, Then, Else Implementation (Floating Point) 


The next NFG implementation on the SRC computer attempts to 
approximate the hardware as depicted in Figure 1. But, instead of describing the 
hardware with VHDL, it is described with C code. The segment index encoder and 
coefficient look-up is realized using “if, then, else” statements where all the values are 
floating point numbers. The slope coefficient is multiplied by ‘x’ and the intercept 
coefficient is added to this product. The multiplication and addition operations are C 
standard library functions. The Jnner Loop Summary report for this implementation 
displayed a pipeline depth of 40 clocks with one clock per iteration. The domain of the 


input was from 0 <x < 0.5 and a 7 segment approximation. 


Therefore, by simply shifting from a library C function which is based 
upon the CORDIC algorithm [20] to an ‘if, then, else’ implementation, a 61% savings in 
clock cycles is realized. Of course, this time savings comes at the cost of a restricted 


input domain and some error in the approximation. 


For this implementation, a follow-on test was performed where slope and 
intercept coefficients were simply set to static values, thus removing the segment index 
and coefficient look-up from the circuit. In this case, the pipeline depth is reduced from 
40 clocks to 36 clocks. If the circuit is further reduced to just a multiplication of slope 
and input, the pipeline depth is reduced from 36 down to 22 clock cycles. And, if the 
circuit is reduced to just an addition of input and intercept without any multiplication, the 


pipeline depth is reduced from 36 to 24 clock cycles. 


It is interesting to note that multiplication takes fewer clock cycles (22 
clocks) than the addition process (24 clocks). This is most likely due to the fact that the 
FPGA has dedicated 18 by 18 multipliers on the chip while it does not have dedicated 
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adders. Therefore, any adders needed in an implementation must be constructed with 
logic and carry chains. Also, it is interesting to see that the time needed to perform the 
one segment implementation (36 clock cycles) does not equal the sum of its parts (the 
multiplier with 22 clocks and the adder with 24 clocks). This seems to suggest that the 
SRC synthesizer is building the circuit with some of the multiplying and adding in 


parallel, thus saving some clock cycles. 


C If, Then, Else Implementation (Fixed Point) 


The next NFG implementation takes the approximation of the hardware in 
Figure | one-step further. The VHDL implementation discussed in Chapter 3 uses fixed- 
point binary numbers. Therefore, an implementation in C code using fixed-point 
numbers was the next appropriate step. The SRC C code in Appendix C for the segment 
index encoder and coefficient lookup uses fixed-point numbers represented in 
hexadecimal. These numbers are essentially integers, since fixed-point numbers are 
simply integers that have been scaled. It is up to the user to interpret the numbers 
correctly by applying the correct scaling factor. In this case, a scaling factor of 2° is 


used. 


The Inner Loop Summary report for this implementation displays a 
pipeline depth of 18 clocks with one clock per iteration. This was for the same domain 


of the input, 0 <x < 0.5, with a 7 segment approximation. 
Thus, by shifting from floating point to fixed-point numbers in the ‘if, 
then, else’ implementation, a 55% savings in clock cycles is realized. 


2. SRC Macro VHDL 


The final NFG implementation is an SRC macro implementation using the VHDL 


source code as described in Chapter III. In this case, the programmer has more control 
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over how the SRC synthesizer constructs the NFG circuit. The VHDL code provides a 


more explicit, behavioral description of how the circuit should work. 


The following is an excerpt from Ref [17] which explains the types of macros 


which are available for use on the SRC. 


Macros can be categorized by various criteria, and the compiler treats them in 
different ways based on their characteristics. In the MAP. compiler, five characteristics 


are particularly relevant: 


°A macro is "stateful" if the results it computes are dependent upon previous 
things it has computed or seen. A simple example is a macro that sums the values that 
are arriving in sequence at its input. In contrast, a "non-stateful" macro computes values 


using only its current inputs; it has no "memory" of its past computations. 


°A macro is "external" if it interacts with parts of the system beyond the code 
block in which it lives. For example, a macro referencing a bank of OBM, or a macro 
that reads/writes a control processor register. The distinction is important in 
determining what things can happen in parallel. Since the effects of executing two 
external macros may be affected by the order in which they are executed, any call to an 
external macro is isolated into a unique block of code. Code blocks are executed 


sequentially; thus, the two external macros are executed in a deterministic order. 


¢"Latency" is the number of clock cycles required between the time that a macro 
is activated with data until valid results appear. Some macros may not have a fixed 
latency. For example, a macro that waits for a flag register to go high will have an 
unpredictable wait. Since the pipelined inner loops generated by the MAP C compiler 
use fixed delay queues to balance the paths through the loop, all macros for inner loops 


must have a fixed latency. 


A "pipelined" macro is able to accept new data values on its inputs while it is still 
internally processing the results from previous input values. A "fully pipelined" macro 


can accept new inputs on every clock. 
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A "periodic-input" macro is one that cannot take new inputs on every clock, but 
rather can take inputs at regular intervals. For example, a macro might be able to take 


new inputs every three clocks. In that case, its "period" is three. 


Five types of user macros can be used by the MAP compiler: Pure Functional, 
Pure Functional Periodic, Stateful, Stateful Periodic, and External. The chart below 


shows their characteristics: 


Tie |e |r| 9 


a 





Figure 9. Types of Macros and their Characteristics (Ref. 2). 


The macro characteristics are designated in the info file. For the NFG VHDL 


macro implementation, the following were used: 









































STATEFUL = NO 
EXTERNAL = NO 
PIPELINED = YES 
AATENCY = 2 


This makes the NFG a Pure Functional macro. 


Upon synthesizing the NFG VHDL macro implementation, the Inner Loop 
Summary report for the NFG displayed a pipeline depth reduction to 12 clock cycles with 
still only one clock per iteration. For macros, it is up to the programmer to manually 
pipeline the circuit in the circuit design and then to set the latency for use by the SRC 
compiler in the info file. When synthesizing this circuit, the SRC synthesizer will 


automatically pipeline the rest of the circuit. 


During synthesis, the SRC system creates a pipeline that feeds into the macro as 
well as a pipeline that collects the return values from the macro. This accounts for the 


difference between the Inner Loop Summary report which gives a value of 12 for the 
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pipeline depth and Figure 8 which shows a pipeline depth of 3. The SRC system has 
added a 10-deep pipeline overhead to the macro circuit. For this configuration, the macro 
has a 5-deep pipeline input and a 5-deep pipeline output. The reason the pipeline depth is 
not 13 (circuit pipeline of 3 plus 10 clock pipeline overhead) is because the programmer 
must indicate a latency of 2 in the info file for the inputs and outputs of this circuit to 
match up. The last stage of this 3-deep pipeline circuit is part of output pipeline created 


by the SRC. 


If the programmer erroneously specifies the latency as 6, for example, in the info 
file, the Inner Loop Summary will output a pipeline depth of 16. In fact, the Inner Loop 
Summary will always output the value for latency in the info file plus 10. The way a 
programmer might know that they have entered an incorrect value for latency is that 
when a number of successive inputs are run through the macro, the outputs will not 
match, but may be one or two outputs off. For example, in a long string of computations, 
the output for the current input may not be correct, but may match up quite nicely with an 
input before, after or nearby. When entering a latency in the info file, the designer should 


use a value that is one less than the pipeline depth that is shown in the circuit. 
A summary of these implementations is in Table 3 below. 


sin (1x) 


Num of | Pipeline | Clks per 
[code| method _| Sogmentation| ‘Segs | Depth. [tortion | Type 


*89 when Tr is omitted 





Table 3. Summary of Implementations. 


When comparing these implementations, it is important to note that the 
first implementation works on floating point numbers over a large domain. The rest of 
the implementations work only over the limited input domain of 0 < x < 0.5. And in the 
last two cases, the implementations use only fixed-point numbers. Therefore, the savings 


in pipeline depth and its associated latency are not realized without drawbacks. 
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In all of the SRC implementations, once the pipeline is full, a new result is 
produced each clock cycle. Running at a nominal 100 MHz, the SRC produces an output 
every 10 nanoseconds. Comparing to the PC C++ implementation, where the average 
time per calculation was approximately 130 nanoseconds, this represents a 13 times 


increase in performance. 


C. ASIC IMPLEMENTATION 


If the circuit described in Chapter 3 was implemented and placed on a free- 
standing FPGA or Application-Specific Integrated Circuit (ASIC), then the 10-clock 
cycle overhead which comes from the nature of SRCs computer system would be 
eliminated. A binary value could be applied to the input pins of the circuit, along with a 
clock signal, and an output could be received three clock cycles later. Of course, without 
the SRC architecture, the ability to build and re-build the circuit for alternate functions 
becomes more difficult. Placing the circuit on an FPGA or ASIC would be best once 
development is complete and for a user that does not need the added flexibility and ease 


of use of the SRC. 


D. FPGA RESOURCES 


As previously discussed, the Place and Route Summary is also a very useful 
report when compiling software on the SRC. This gives the developer an idea of how 
much of the FPGA is being used, and how much is still available. Table 4 below 


summarizes the amount of the FPGA chip used by the final implementation. 


Code| Method | Segmentation | Slice FF | LUT's | Occ Slices | Mult18x18| Freq (MHz) 
VHDL 4058 (6%)|1530 (2%)| 2425 (7%)|—1(1%)|_ 90.7 





FPGA Total 67584| 67584 33792, 144”—~Oi‘“‘(its;‘«*’ 
Table 4. Summary of FPGA Resources Used. 
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Table 4 shows that for the final implementation only approximately 7% of the 
FPGA chip is used. This leaves a lot more space to add other components, possibly other 


NFGs, to the overall circuit. 


E. COMPUTATION RESULTS 


Random values were generated and fed into the SRC macro by the main.c and 
sine.mc routines. The results of the computations performed by the NFG were compared 
with values computed by Microsoft (MS) Excel. Table 6 provides a summary of these 
results. The table shows the value of x and the resulting value of f(x) as computed on the 
MAP. This value is then compared to a value computed by Excel (algorithm unknown). 
The last two columns show the difference between Excel and the NFG and quantifies that 
difference in terms of the Basic Unit of Accuracy (BUA). One BUA is equivalent to the 
value of the LSB (2°) for the output number. 








BUAs Difference, A 
0 A<BUA 
1 BUA <A <2 BUAs 





2 2 BUAs < A<3 BUAs 


3 3 BUAs <A <4 BUAs 





4 4 BUAs <A <5 BUAs 











5 5 BUAs < A< 6 BUAs 














Table 5. Difference vs Bit Equivalency 
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f(x)[14:0] 
| Hex | Decimal | Hex | Decimal _| Hex | | BUA | 


| Decimal | | Decimal _ 

| Osxs05 {| | 28 = 0.003906] | 
| 10 | 0.0625] 31 | 0.1950903] 32 | 0.195313] 0 | 0.000222 | 
| 20 | 0.125] 61 | 0.3826834| 62 | 0.382813] 0 | 0.000129 | 
| 30 | 0.1875] 8E_| 0.5555702| BF | 0.558594] 0 | 0.003024 | 
0.25] B5_| 0.7071068| B6 | 0.710938] 0 | 0.00383: | 


0.003232 
0.003345 
0.000553 
1 | 0.004143 
0.003605 
1 | 0.005131 
| 


0.004866 
0. 3 
5 


1.691406] 7F2C | -0.8245893| 7F2B | -0.832031 0.007442 
0.8125] 8E | 0.5555702| 8D | 0.550781 1 | 0.004789 


0.0245412] 5 | 0.019531 0.00501 
0.953306 F3 | 0.949219 0.004087 


TT sarin) 
ipseexeta| =i B=] ODODE —C‘~*d 
44 [0.078126] 198 | 15000982| 198 | 1.59375] 0] 0.002048) 
P45 | 0.082031] 194 | 15813459] 194 | 7.578125] 0 | 0.003221 
35 | 0.207031] 141 | 1.2640444| 140| 1.26] 1 | 0.004944 
31_| 0.191406] 149 | 1.2858204| 148 | 7.28125] 1 | 0.004579] 
r1 [0.003906 25a | 2.35482[ 265 | 2.332031] 5 | 0.022789, 
[49 [0.097656] 186 | 15252218] 186 | 1.523498] 0 | 0.001784 


Table 6. Examples of Output Results. 





F. SOURCES OF ERROR 


As shown in Table 6 above, the value of the function as computed by the NFG 
and as computed by MS Excel does not always match. At most, these two values differ 


by one BUA (2° which is equivalent to 0.003906), in all of the above cases with the 
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exception of one case (,/—In(x) with an input of 1). This is proof that the NFG is 


computing f(x) properly. The difference between the NFG and Excel outputs can be 


attributed by several possible sources of error. 


1. Function Approximation 


As shown in Chapter 2, the NFG uses an approximation of the actual function to 
perform its calculations. Figure 5 shows that the approximation and the function do 
differ by up to some maximum error. This difference between the approximation and the 


actual function is a source of error in the circuit. 
2. Conversion from Decimal to Binary in MATLAB and Excel 


Both MATLAB and Excel are taking floating-point numbers and converting them 
into binary, fixed-point numbers. Some error is going to be introduced, since most 
floating point numbers do not exactly convert, given the chosen accuracy. In this case, 
the conversion algorithm will need to round the floating-point number. This rounding is 


a source of error. 


33 Absence of Rounding in the Multiplier and Adder 


In the NFG circuit, the complete result of the multiplier and the adder are not 
used. In the VHDL code that describes the multiplier, two 7.9 numbers (7 bits to the left 
and 9 bits to the right of the decimal point) are multiplied. This product will be in the 
form of a 32-bit (14.18) product. In order to convert the product back to a 7.9 number, 
only the middle 16-bits (bits 9 through 24) are used. The other bits are simply truncated. 
A more sophisticated circuit would use a rounding algorithm when removing the lower 
order bits. The rounding algorithm would look at the 10" bit to the right of the decimal 
point. If this bit were 0, then the rest of the bits would be truncated. If this bit were 1, 


then 1 would be added to the 7.9 number. A rounding algorithm should be one of the 
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items for future work, thus eliminating this source of error. This same argument holds 
for the adder where there is a 16-bit output which is converted into a 15-bit output by 
truncating the LSB. 


4. Insufficient Bits 


As more bits are used to the right of the decimal place, numbers and arithmetic 
which is being performed in binary will be able to be closer to their real values. This is 
why the NFG uses an LSB of 2° in the working portion of the circuit. Even more bits 


would reduce this source of error by greater amounts. 


3] 
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VI. CONCLUSION 


A. SUMMARY OF WORK 


In this thesis research, a high-speed NFG circuit was developed on a COTS, 


reconfigurable computer. 


First, MATLAB routines were developed which will approximate a given 
function with successive first-order, linear segments. The MATLAB code will divide the 
function into either uniform (same length) or non-uniform (variable length) segments. In 
the uniform case, the user can specify whether to use a certain number of segments or to 
meet a desired maximum error constraint. In the non-uniform case, only the latter choice 
is available. MATLAB generates the approximation along with VHDL code that has the 


segment endpoint, slope and intercept information. 


It was decided to use MATLAB to generate VHDL to make the overall process 
easier on the end-user. MATLAB has the capability to write to a file. In the NFG circuit, 
the only object that distinguishes one function from another is the segment index encoder 
and coefficient look-up. Therefore, if the user wants to implement a different function, 
they just need to run the MATLAB routine and retrieve the MATLAB-generated VHDL 
file. This is much easier than having to manually inserted the MATLAB calculations into 


pre-formed VHDL code. 


VHDL code was developed for the rest of the circuit and combined with the 
MATLAB-generated code to form the complete description of the circuit. 


The VHDL code was transferred to the SRC-6E computer system where it was 
placed into a macro. The SRC used the macro VHDL code along with all of the other 


required information files in order to implement the NFG on the FPGA. 


Several versions of the NFG were generated, both on and off the SRC, in C and 


VHDL code for comparison. 


The NFG implementations were compared in Chapter IV as summarized in Table 


3. The table provides the parameters used in comparing the various NFG 
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implementations. These parameters were pipeline depth and clocks per iteration. In the 


end, it is really up to the customer to decide what features are the most important to them. 


Pipelining is what makes the SRC so powerful. All of the SRC implementations 
provide an output once every clock cycle. Albeit, on the SRC, the clock speed is what 
may be considered a relatively slow 100 MHz. When millions of calculations are needed 
to be performed in succession, this pipelined approach will probably be the quickest. If 
the latency between input and output is important to the user, then one of the 
implementations where latency is minimized should be considered. As with all things, 
each advantage has a disadvantage. The fastest circuit (smallest latency) has the most 
limitations. The circuit works only with fixed-point numbers and over a limited range of 
domain for the input values. Larger domains may be realized at the expense of more 


segments and thus more memory. 


On the SRC, the slowest implementation had a pipeline depth of 104 clock cycles. 
The architecture of the NFG implementation reduced this down to 12 clock cycles, 
representing more than an 88% performance increase for this parameter. Additionally, 
when comparing the NFG to a function implementation using C++ on a personal 


computer, a performance increase of 92% was shown. 


The NFG circuit can easily be reconfigured to generate alternate functions, simply 
by replacing the segment index and coefficient look-up portion of the circuit. Not only 
can elementary functions be approximated, but the complexity can easily increase with 
no adverse effects. For example, some of the functions in Table 1 such as, the entropy 
function (xlogox — (1-x)log2(1-x)), would require several computations on a general 
purpose CPU. That is, parts such as /ogox and log2(I-x) are computed separately, 
multiplied by x and /-x, respectively and then summed, whereas, in the NFG circuit, all 


of the computations are done in the same circuit as for any other function. 
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B. SUGGESTED FUTURE WORK 


iL. Multiple NFGs in Parallel 


One of the advantages about the FPGA implementations is the relatively small 
amount of the FPGA-chip that is consumed. The circuit can be made even more complex 
up to the limitations of the resources on the chip. One such implementation may be that 
several NFGs are implemented in parallel on the chip. Each one could separately, in 
parallel be calculating the result of a function based from an applied input. Another input 
could be used to decide which of the outputs is desired, by means of a selector. In this 
way, several functions are implemented and the chip does not have to be reprogrammed 
each time a different function was needed. This is very reasonable as evident in Table 4 


where the sample function, sin(zx), only takes up 7% of the FPGA. 
































X 
NFG 1 NFG 2 NFG 3 
sel Selector 
f(x) 
Figure 10. Multiple NFGs Working in Parallel. 


Another variation on this theme would be where multiple segment index and 
coefficient look-up modules exist in the circuit, but the multiplier and adder are not 
duplicated. In this circuit, a selector would choose which segment index and coefficient 


look-up module to use based upon which function was needed to be calculated. 
4] 


2s 16-bit and Higher Implementations 


A similar circuit should be built which implements 16, 32 and 64-bit versions of 
the circuit. These higher accuracy circuits may be more in line with the needs of 
potential users. Higher accuracy will mean smaller errors, more segments and more 
resources of the FPGA being consumed. This will most likely require a shift in 
architecture from the current segment index and coefficient look-up module, which uses 


“if/then” statements, to one which is based on memory (see #2 below). 


3. Memory Vice If/Then 


Currently, the segment index and coefficient look-up occur in the same module, 
which is built from a behavioral VHDL description using if, then, else statements. As 
shown by synthesizer reports produced by the SRC and Synplicity (see Appendices F and 
G), this module is created using a chain of logic blocks (LUTs). This architecture has its 
limitations. As the complexity of the circuit increases, due to more segments and higher 
accuracy, the logic chains will get longer and the delay of the longest path will get longer. 
This in turn, will lead to slower clock frequencies in order to accommodate the longer 
delay path. The solution to the above problem is to split up the segment index and 


coefficient look-up portions of the circuit. 


The coefficient look-up module is achieved by programming memory that is pre- 
loaded with the values as calculated by MATLAB. There are Xilinx primitives such as 
RAMB16_S18_ S18, which is a 16 kilobyte block RAM with two 18-bit outputs. The 
RAM can be described in VHDL or Verilog, to include the initial values of the memory. 
MATLAB could write the VHDL code, similar to how it is done now. The memory 
would only have to be re-programmed when implementing a different function. Thus, 
the RAM is really just a ROM during NFG execution. One other option investigated, but 


currently does not work, would be to have MATLAB only generate the initialization 
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values for the RAM. These values would be placed in a separate file that would be 
opened by the VHDL code that describes the RAM. This was tried unsuccessfully but 


may be possible. 


One drawback from this architecture is that the RAMBI6_ SIS S/8 is a 16 
kilobyte RAM. It is not certain what happens when the entire 16k is not used. Does the 
Synthesizer only tie up the FPGA resources that are needed? Or does it use the entire 
16k? Obviously, the former would be desired. In this way, a memory is created which 


can expand or contract based upon the characteristics of the function. 


Another drawback is that when using memory for coefficient look-up, only have 
the problem is solved. The current architecture uses one module for both segment index 
encoding and coefficient look-up. When using memory, the segment index encoding 
portion of the module would have to occur in another, separate module. The memory 
simply holds the values of the coefficients. The segment index encoder would provide 


the address to memory for which location to read. 


As previously discussed, there are two ways to approximate a function; uniform 
and non-uniform segmentation. In uniform segmentation, all segments are the same 
length, and the number of segments should be a power of two. The segment indexing for 
uniform approximation is simpler, since a certain number of bits (logs of the number of 


segments) is simply used as the read address for the memory. 


Segment indexing for non-uniform approximation is more difficult. Each 
segment may be a different size, so there is no set pattern to follow to determine which 
segment the input belongs. References [2] and [5] describe how to use a LUT cascade to 
encode an input into a particular segment. What is not known, is how much of a delay to 


the overall circuit the LUT cascade will cause. 


4. Uniform Approximation Increases to Next Power of 2 


It has already been shown that when using uniform segmentation, it makes the 


most sense to divide the function up into a number of segments equivalent to the power 
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of two. The MATLAB software code should be modified to automatically do this, rather 
than having the user run the software routine twice. Once to determine the number of 
segments for a given error and again to increase the segmentation to the next power of 


two. 


3 Higher-Order Approximations 


The use of higher order approximations, such as 2" order (vy = (cox + cnx + co) 
may be able to better approximate a function with fewer segments. This will result in a 
memory savings and less complex segment indexer, although, the circuit will be more 
complex due to the addition of a multiplier (for the c,x) and an adder. Additionally, the 
memory will have to provide three coefficients vice two. Ref [21] is currently 
investigating higher order approximations. Most likely, the benefits of higher order 


approximations will only be realized on specific (less linear) functions. 


6. Different Architecture, y = ci(x-p) + co Circuit 


Another architecture which has been considered is one of the form where the 
characteristic equation is: f(x) = ci(x-p) + co. In this case, the input x has a pivot point, p 
which is subtracted from it prior to multiplication. This architecture may lead to a 
smaller multiplier in the circuit, thus saving time and resources. Although, since the 
Xilinx Virtex-2 has 18 by 18 multipliers resident on the chip, the synthesizer might just 
use the 18 by 18 multiplier anyhow. The circuit will have to have three coefficients 
provided from memory and an additional subtractor. The advantages of this architecture 


need to be investigated. 


Ts Use of Remez Algorithm for Segmentation 


Upon examination of Figure 6, it is evident that error at one endpoint of a segment 


may not necessarily match that at the other endpoint of the segment nor the endpoint of 
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the adjacent segment. The error may be close, but not exactly the same. This anomaly is 
most likely due to the use of the polyfit function in the MATLAB segmentation routines. 
The polyfit function performs an approximation of the given points where the error is 
minimized in a least squares sense. This is different than when the error to be minimized 
is the maximum error. The Remez algorithm is being investigated in Ref [21] which may 


solve this anomaly. 


8. Rounding Vice Truncation 


As discussed in Chapter V, the current circuit does not use any rounding 
algorithms. Not using rounding is a potential contributor to the overall error of the 


circuit. Future circuits should use a rounding algorithm. 


45 


THIS PAGE INTENTIONALLY LEFT BLANK 


46 


APPENDIX A. MATLAB ALGORITHMS 


The following MATLAB Code generates the segmentation for any given function. 


A. LINEAR APPROXIMATION USING POLYFIT 


ee 


%& This program produces VHDL code for the SRC-6 that realizes a numeric 


%& function generator (NFG). The user specifies a function to realize, a 
% domain over which the function is realized, a desired error, and a type 
% segmentation (uniform or non-uniform). The program computes a piecewise 


%& linear approximation using the specified segmentation-type that is 
% accurate to the specified error of the specified function in the 
specified interval. 








% Created: March 16, 2005 (adapted from Arbitrary_Slope_Piecewise_Linear.m 
% written by Jon Butler) 

% Last modified: January 16, 2007 

% Produced by: Tom Mack 

& This program applies an algorithm that produces 

% 1. Uniform piecewise linear approximation 

% 2. Non-uniform piecewise linear approximation 

% For a description of the algorithm see C. L. Frenzen, T. Sasao, and J.T. 

% Butler, "The tradeoff between memory size and approximation error 

% in numeric function generators based on table lookup," preprint. 


& For non-uniform segmentation, it uses MATLABs polyfit algorithm. 
For uniform segmentation, the program determines the minimum segment 
length needed at the point of greatest curvature. 


% Inputs 


% 1; N —- number of elements on which function is expressed 
% Ze Hee) —- function to be evaluated 
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% 3. x_low - low end of interval over which f(x) is evaluated 
% 4. ‘_high - high end of interval over which f(x) is evaluated 
% 5. epsilon - precision of approximation (for non-uniform only) 
% 6. consegs —- number of segments to use to approximate (for constant 
only) 
%& Outputs 
% 1. Segment Info - Segment No, Beginning Pt, End Pt, Slope, 
Intercept, and Error 
% 2. Plot showing the approximation 
% 3. VHDL code for the SRC Computer 
$$%5S%SSSSSSSSSSSSS$SS5S INPUT OF USER-SPECIFIED PARAMETERS %%%%%%%%%%%S%S%S%S%S%% 
clear 
close all 
format long 
fprintf('\n') 
fprintf ( L \ 8% KR RR A RR A RA A I I II I KK TY 
fprintt (vn) 
fprintf('\n LINEAR APPROXIMATION OF A FUNCTION USING POLYFIT with 
INTERCEPT SHIFTING') 
fprintf('\n [DEFAULT in BRACKETS] ') 
fprintf('\n\n') 
func = input( ‘Input the Function, func[sin(pi*x) ]: ", TS) > 
if isempty (func) 
func = ‘sin (pa*x)"; SDefault 
end 
x_low = input( 'Input the Range of x - LOW value, x(low) [0]: Vyas 
if isempty (x_low) 
x_low = 0; sDefault 
end 
x_high = input( ‘Input the Range of x —- HIGH value, x(high) [0.5]: '); 
if isempty (x_high) 
x_high = 0.5; SDefault 
end 
vari_or_const = 0; 
while vari_or_const ~= 1 & vari_or_const ~= 2 & vari_or_const ~= 3 %Check 
for erroneous input 
vari_or_const = input( '(1)Non-uniform or (2)Uniform Segmentation or 
(3)Both [1]: De 


if isempty (vari_or_const) 
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vari_or_const =1; SDefault 











end 
end 
if vari_or_const ~= 2 
epsilon = input( ‘Input the Desired Error, epsilon[2%*-9]: se 
if isempty (epsilon) 
epsilon = 2%-9; 
end 
end 
if vari_or_const == 
err_or_segs = input( 'Would you like to constrain (1)Number of Segments 
or (2)Error [1]: ie 
if isempty(err_or_segs) 
err_or_segs = 1; 
end 
if err_or_segs == 
consegs = input( ‘Input the number of Desired Segments[16]: '); 
if isempty (consegs) 
consegs = 16; 
end 
end 
if err_or_segs == 
epsilon = input( ‘Input the Desired Error, epsilon[2*-9]: ")? 
if isempty (epsilon) 
epsilon = 2%-9; 
end 
end 
end 


N = input( ‘Input the no. of pts the fct is to be evaluated (per unit), 
N[10000]: ")3 
if isempty (N) 
N = 10000; 
end 
eqn = input( 'Input the equation to use: (1)F(x)=mxtb or (2)F(x)=m(x-p) +b, 
[ld D9 
if isempty (eqn) 
eqn = 1; 
end 
N= N * (x_high - x_low); 
x = linspace(x_low, x_high, N); 
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Some sample functions 


% Eune = "=; "loge ) — (1=x) ..*begZ.(1—x) *; 

& func = '(1/sqrt(2*pi))*exp(-sqrt (x.%2)/2)'; 

%& func = 'sin(x)'; 

& func = '2.%x'; 

& func = '1./((«x - 0.3).%2 + 0.01) + 1./((*% - 0.9).%2 + 0.04) - 6;'; 
& func = 'humps(x)'; 

S ftune = “sin (s)./ x's 

S fune =— “sani(1. fs). y's = O22 OOLSL.OF 
SESESESCEESESSESESESESESESSESESSE NOTES SSSSSSSSSSESSESESSEEEES 


The segments in this program do NOT overlap (i.e. the first element of 














the NEXT segment 
% is NOT the last element of the PREVIOUS segment. 
% 
SESEEEESCEEEEEESEEEEEESEEEEEEEEEEEESEEEEEESEEEEEEESEEEEESES 
eval(['F = ', func, ';"']) 
% Print demarcation line 
fprintf ( ¥ \ TV KK RR I A A RA A I AR Rk A A A AK I I I KK TY 
fprintf('\n') 
$SSSSSSSSS$S$SSSSSSS5S55S5S% Segmentation Algorithm %%SSSSSSSSSSSSSSSSSSSSSSS 
$5 55S%SSS5S5SSSSSSSSSESSESSESESESS REPEAT FOR EACH 1 SS33SSSSSSSSSSSSESESSSSEEES 
repeat = 1; 
while repeat == 1 
if (mod(vari_or_const,2) == 1) 
[endpt, seg_end_point,c_1,c_0] = multiplelinapprox(x,F,epsilon) ; 
end 
if (vari_or_const == 2) & (err_or_segs == 1) 
[endpt, seg_end_point,c_1,c_0] = constantlinapprox(x,F,consegs) ; 
end 
if ((vari_or_const == 2) & (err_or_segs == 2)) | (vari_or_const == 4) 
[endpt, seg_end_point,c_1,c_0] = constlinappxwerr(x,F,epsilon) ; 
end 
$SSSSSSS%S Compute and plot function, approximate function and error 
ind = 1; 
for i = l:length(seg_end_point); 
m= 1; 
XP L]; 
FP []; 
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] 
d < seg_end_point (i) ) 











== Il 
while (in 
XP(m) = x(ind); 
FNC (m) = F (ind); 
FP(m) = c_1(i)*x(ind) + c_O(i); % FP is fct piecewise 
Error(m) = FNC(m) -— FP(m); 
ind = ind + 1; 
m=m+ ii; 
end % while 
MaxError(i) = max(abs(Error)); 
if (mod(i,2) == 0) % Plot every other segment a different color 
Figure (mod (vari_or_const,2)+1) 
plot (XP,FP) 


Figure (mod (vari_or_const,2) +3) 
lot (XP, Error) 





Oo 


else 


Fh 


igure (mod (vari_or_const,2) +1) 
lot (XP,FP,'r', 'LineWidth',2) 
igure (mod (vari_or_const,2) +3) 
lot (XP, Error, 'r', 'LineWidth', 2) 
end % if (mod(i,2) == 0) 

figure (mod (vari_or_const,2)+1) 

hold on 

xlabel ('x", 'FantSize'’,10) 

ylabel ('f(x)', 'FontSize',10) 

if (mod(vari_or_const,2) == 1) 

title(['NON-UNIFORM f(x) segmentation. No. of segments = 

',num2str (length (seg_end_point)),'.'],'FontSize',10) 

elseif (mod(vari_or_const,2) == 0) 
title(['UNIFORM f(x) segmentation. No. of segments = 











Hh 'O 








0 'O 






































',num2str (length (seg_end_point)),'.'],'FontSize',10) 
end 
figure (mod (vari_or_const,2) +3) 
hold on 
xlabel ('x', 'FontSize',14) 
ylabel(['Error(x). Max Error = 
' num2str (max (MaxError)),".'],'’FontSize',10) 





if (mod(vari_or_const,2) == 1) 
title(['Error for NON-UNIFORM f(x) segmentation. No. of segs = 
',num2str (length (seg_end_point)),'.'],'FontSize',10) 
elseif (mod(vari_or_const,2) == 0) 


mall 





title(['Error for UNIFORM f(x) segmentation. No. of segs 
',num2str (length (seg_end_point)),'.'],'FontSize',10) 
end 


end © for i = 1:length(seg_endpt) 
figure (mod (vari_or_const,2) +1) 


plot (x,F) % Plot function on same figure as piecewise approximation 
stem(x(seg_end_point),F(seg_end_point) ) 

hold off 

$SSSSSS$SS$SSSSSSSSSSSSSS% Decimal to Binary Conversion Algorithm 

% Convert string end points, c_l and c_0 into a binary string with 

% 8 fraction bits and print results table 
2220999.99999000999999999999999900999.99999990999.9999999999999000990. 
0000000000000 0000000000000000000000000000000000000000000000000 0 


if (mod(vari_or_const,2) == 1) 
fprintf('\n NON-UNIFORM Segmentation') 
elseif (mod(vari_or_const,2) == 0) 
( 














fprintf('\n UNIFORM Segmentation') 

end 

if eqn == 

fprintf('\n Segment End Point End Point ec: ome 
c_0 60") 

fprintf('\n Number (Decimal) (Binary) (Decimal) (Bin 
(Decimal) (Binary) ') 

end 

for i = l:length(seg_end_point) 

xbin(i) = dec2binfp(x(seg_end_point (i))); 

segment (i+1) = x(seg_end_point (i)); % Used in next program 

c_lbin(i) = dec2binfp(c_1(i)); 

c_Obin(i) = dec2binfp(c_0(1i)); 

if eqn == 

% Print Remaining Results Table 

fprintf('\n S3d $3..6£ S017 ,9f. 410,5£ SULT.9f 610.5£ 


SOLVE", a=1, x(Sseq_end_point (1) ), xbin(a);, Collis); eiban(i), E0( 
c_Obin (i) ) 


end % if eqn == 
end Sfor i = 1l:length(seg_end_point) 


% Create text file to initialize memory 


% mem, = [e@_lbin. .* 10° = c Oban ..* 10*9)]> 
% fid = fopen('memory.mem', 'w'); 


& fprintf (fid, '%016.0£%016.0f\n',mem) ; 
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ary) 


i), 


% fclose(fid); 
% End text file creation 


% Create VHDL file for use by the SRC-6. 




































































fid = fopen('slopeintlu.vhd','w'); 

Dee ee eee VHDL Code S%3SSSSSSSSSS5SESESSSEEESSEEES 
fprintf(fid,'%Ss\n','library IEEE;"'); 

a a IEEE.STD_LOGIC_1164.ALL;'); 

fprintf(fid,'%s','use IEEE.STD_LOGIC_ARITH.ALL;'); 

fprintf(fid,'%Ss\n','use IEEE.STD_LOGIC_UNSIGNED.ALL;'); 
fprintf(fid,'%Ss\n','--- GENERATED BY MATLAB ROUTINE LinAppxPfit.m -—--'); 
fprintf(fid,'%Ss\n','--- Written by Tom Mack, 5/10/2006. Modified 1/16/07. 
fprintf(fid,'%Ss\n','--- Segment Encoder outputs the corresponding slope aoe 
intercept for the segment based"); 

fprintf(fid,'%Ss\n','-- upon segment endpoints.'); 

fprintf(fid,'%Ss\n','--- Segendpt and slope and intercept determined by 
MATLAB? program LinAppxPfit.m'); 

fprinté(fid, Ss\n", "=-="); 

fprintf(fid,'%Ss\n','library UNISIM;'); 

fprintf (fid,'%Ss\n', 'use UNISIM.VComponents.all;'); 

fprintf(fid, oleh ‘entity slopeintlu is'); 

fprintf(fid, 'Ss%d%s\n',' generic (x_bits:integer:=16; s_bits:integer:=16; 
i_bits:integer:=16; enquitntegac:—". Leneeh pamela *) 5 The 
fprintf(fid,'%Ss\n',' Port ( & £ an std_logic vector(x bits=1 downto 

Oe ")e 

fprintf(fid,'%Ss\n',' slope : out std_logic_vector(s_bits-1 downto 
Ode"); 

fprintf(fid,'%Ss\n',' intercept : out std_logic_vector(i_bits-1 downto 
OF ')G 

fprintf (tid, 'es\n",' type ENDPT is array(0 to segs-1) of 





std_logic_vector(x_bits-1 downto 0);'); 
fprintf(fid,'%Ss\n',' end slopeintlu;'); 





fprintf(fid,'%s\n',""'); 

fprintf(fid,'%Ss\n',' architecture Beh of slopeintlu is'); 
fprintf(fid,'%Ss\n',' begin"); 

fprintf(fid,'%Ss\n',' process (x)'); 

fprintt (fid, 'ss\n"', ' variable SEGENDPT:ENDPT;'); 
fprintf(fid,'%Ss\n',' begin'); 


for i = l:length(seg_end_point); 
fprintf (fid, '%s%d%s%016.0f%s\n',' SEGENDPT(',i-1,') := 
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v? ebindijrierg,* ©)? hg 


end 

fprintf(fid,'%s\n',"''); 

fprintf(fid,'%Ss\n',' if = < 

fprintf (fid,'%s%016.0f%s\n',' 

fprintf (fid,'%s%016.0f%s\n',' 

for i = l:length(seg_end_point) -2; 
fprintf (fid, '%s%d%s\n',' 
fprintf (fid,'%s%016.0f%s\n',' 
fprintf (fid,'%s%016.0f%s\n',' 

end 


fprintf (fid,'%s%016.0f%s\n',' 
"") coc _lbin(length(seg_end_point) ) *10 
fprintf (fid,'%s%016.0f%s\n',' 
"" coc Obin(length(seg_end_point) ) *10 





forintf (fid, '$s\n",;" end if; 
forintf (fid,'$s\n",;* end process 
fprintf(fid,'%Ss\n',' end Beh;'); 

29999099099999999999999999899989989R9R2R9RD 
0000000000000 000000000000000000000 70 


fclose(fid); 
& End text file creation 
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SEGENDPT (0) then'); 
slope <= "',c_lbin(1)*10*%9,'";'); 
intercept <= "',c_Obin(1)*10%9,'";'); 














elsif x < SEGENDPT(',i,') then'); 
slope <= "',c_lbin(i+1)*10%9,'";'); 
intercept <= "',c_Obin(it+1)*10%9,'";"'); 


else slope <= 
eT eae 
intercept <= 
"oe se ae 
‘)3 
N)s 


4 ¥ 


oe 6 oo oo 0.010 Coo 


if eqn == 2 

$$S$SSS$%SSSSSS5SSSS% The following created from: Extract_PL_Params.m 

% 

%& This program extracts from the segmentation and the function, the 

% 1. Slope 

% 2. Intercept 

% Se IPavet 

% 

% which are the parameters needed to store in the coefficients memory. It 
% produces the BINARY values of these parameters. 

% 

% The segmentation occurs as a vector of end points. 

% 

CSSSCCCSSEEEEEEEEEEEEEEEEEEEEE ESSE SESS SSS SSS SSS SESEEEEEEEEEEEEEEEEEEESEESES 
fprintf('\n') 

fprintf ( ' \ V8 KK RR AR A A A RA A A I IK I I KK eK TY 
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fprintf('\n') 
segment (1) = 0; 
for i = 1:length(segment) 
seg_index(i) = floor (N*segment (i) / (x_high-x_low) ) +1; 
end Sfor i = 1:length(segment) 
seg_index; 
for i = 2:length(segment) 















































slope(i-1) = (F(seg_index(i)-1) F (seg_index(i-1)))/(x(seg_index (i) -1) 
x (Seg_index(i-1))); 
intercept (i-1) = F(seg_index(i)-1) slope (i-1) *x(seg_index(i)-1); 
a = max(F(seg_index(i-1):seg_index(i)-1) - 
(slope (i-1) .*x(seg_index (i-1) :seg_index(i)-1)+intercept(i-1) ) ); 
b = min (F (seg_index(i-1) :seg_index(i)-1) - 
(slope (i-1) .*x(seg_index (i-1) :seg_index(i)-1)+intercept(i-1) ) 3 
error(i-1l) = 0.5*(a + b); SYES, it is a+b. One of a and b is OQ, and 
a >=0 and b <=0. 
intercept (i-1) = intercept (i-1) + error(i-1) + slope(i-1)*segment (i-1); 
s_m_e(i-1) = segment(i) - segment (i-1); 
clx(i-1) = s_m_e(i-1)*slope(i-1); 
approx(i-1) = clx(i-1) + intercept (i-1); 
exact (i-1) = 2*%segment (i); SExact value of f(x) at the end of 








the segment. 
end %Sfor i = 2:length(segment) 
fprintf('\nDECIMAL values for Approx = slope*(x - pivot) + intercept.') 
fprintf('\nseg no. [s, e] slope intercept pivot 
approx_error e-s (e-s)*slope (e-s)*slopetintercept exact f(x) \n') 
for i = l:length(segment)-1 
fprintr (*sl.0f [68.66 68.6£]) S8.6£ BoaOr $8.6f %8.6f S8.6£ 
%8.6£ %8.6£ $8.6f \n', i-1, segment(i), segment(itl), slope(i), 
intercept (i), segment(i), error(i), s_m_e(i), clx(i), approx(i), exact (i) ) 
end Sfor i = 1:length(segment)-1 
shold on 
Splot (x(1:N),slope(1).*x(1:N)+intercept (1) ) 
SConvert s, e, slope, intercept, and pivot to binary. 
fprintf('\nBINARY values") 
fprintf('\nseg no. [s, e] slope intercept 
approx_error e-s (e-s)*slope (e-s)*sltintercept exact f(x) \n') 
for i = l:length(segment)-1 
digits = ceil (log2(length(segment)-1)); 
s_seg_no = dec2bin(i-1,digits) ; 
s_s(i) = dec2binfp (segment (i)); 


DO 


s_e(i) = dec2binfp (segment (i+1)); 


s_slope(i) = dec2binfp(slope(i)); 
s_intercept(i) = dec2binfp (intercept (i)); 
if error(i) < 0; 

error(i) = abs(error(i)); 
end % if error(i) < 0; 
s_error(i) = dec2binfp(error(i)); 
s_s_m_e(i) = dec2binfp(s_m_e(i)); 
s_clx(i) = dec2binfp(clx(i)); 
S_approx(i) = dec2binfp(approx(i)); 
s_exact (i) = dec2binfp (exact (1) ); 


forintf("ss [S10.8£ 210.8f] 210.8f S10.8fF 410.8f S10.8f 410.8F S10 .8£ 
$10.8f \n', s_seg_no, s_s(i), s_e(i), s_slope(i), 
s_intercept (i),s_error(i),S_s_m_e(i), sS_clx(i), S_approx(i), s_exact (i) ) 
end %for i 
end % if eqn == 
fprintf('\n') 
fprintf ( v \ 8% KK KR RR A RR A A A A I IK I I IK KK KA TY 
fprintf('\n') 
if vari_or_const ~= 3 


repeat = 0; 
end 
if vari_or_const == 3 
vari_or_const = 4; 
end 
end %& while repeat = 1 





% End file: LinAppxPfit.m 


B. MULTIPLE LINE APPROXIMATION 


function [endpt,indx,cl,c0O] = multiplelinapprox (x, fct,max_error) 

This function will produce multiple straight-line approximations of a 
given function to within the bounds of max error provided. 

Created by Tom Mack 

Created: Mar 31, 2006 


AJP oP WP oP 


oe 


i = 1; indx = 1; seg_no = 1; endpt = []; cl=[]; c0O=[]; 

while i < length(fct) 
[endpt (seg_no),indx(seg_no),cl(seg_no),c0O(seg_no)] = varlinapprox(x,fct,max_error,i); 
1 = indx(seg_no) + 1; 
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seg_no = seg_no + 1; 
end 


C. NON-UNIFORM LINEAR APPROXIMATION 


function [endpt,i,cl,c0O] = varlinapprox (x, fct,max_error, indx) 
% This function creates a straight line approximation of a given function 
using the polyfit function. It continues to calculate polyfits until 
maximum error is exceeded. 
Created by Tom Mack 
Created: Mar 31, 2006 
Modified: Jul 10, 2006 
for i=indx+1l:length(fct); 
p = polyfit (x(indx:i),fct(indx:1i),1); 


AP AP ol? of 


oe 





oe 











62G) |p): 20) = pe; 
approx (indx:1) = p(1)*x(indx:i)+p(2); 
errors = approx(indx:i) - fcet(indx:i); 
maxposerror = max(errors); 
maxnegerror = min(errors); 
c_Odelta(i) = abs((abs(maxposerror) - abs(maxnegerror))/2); 
if abs(maxnegerror) > abs (maxposerror) 
c_Odelta(i) = -1 * c_Odelta(i); 
end % if 
approx (indx:1) = approx(indx:i)- c_Odelta(i); 
errors = approx(indx:i) - fct(indx:i); 
error = max(abs(errors)); 
if error > max_error 
endpt = x(i-1); 
i = i-1; 


cO = c_0(i-1)- c_Odelta(i-1); 
revurn 

end % if error > max 

endpt = x(i); 

i = i-1; 

cl = c_l(i); 

cO = c_0(i)- c_Odelta(i); 

end % for i=indx+1:length(fct) 
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D. UNIFORM LINEAR APPROXIMATION 


function [endpt,indx,cl,c0O] = constantlinapprox (x, fct, consegs) 


oe 


oe 


This function will produce multiple straight-line approximations of a 
given function to within the bounds of the number of segments provided. 
Slope and intercept calculated by polyfit. Intercept adjusted to 
balance maximum positive and negative errors. 

Created by Tom Mack 

Created: June 4, 2006 

Modified: July 11, 2006 


oe 


ol? 





oe 
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idx=1; 
for i = l:consegs 
indx(i) = round((length(x) /consegs) *i); 
if i==consegs 
indx(i) = length(x); 
end 
endpt (i) = x(indx(i)); 
p = polyfit (x (idx: ange )),fct (idx:indx(i)),1); 
approx (idx: indx (i) ) p (1) *x(idx:indx(1i))+p (2); 
errors = approx(idx: a, — fect (idx:indx(i)); 
maxposerror = max(errors); 
maxnegerror = min(errors); 
c_Odelta = abs((abs(maxposerror) - abs (maxnegerror) )/2) 
if abs(maxnegerror) > abs (maxposerror) 
c_Odelta = -1 * c_Odelta; 
end % if 
cl(i) = p(1); a ) = p(2)- c_Odelta; % Intercept shift to balance pos & neg error 
idx = ease 
= Sal 
end 


E. UNIFORM LINEAR APPROXIMATION WITH ERROR BOUNDS 


function [endpt,indx,cl,c0O] = constlinappxwerr (x, fct,max_error) 


oe 





oe 


This function will produce multiple straight-line approximations of a 
% constant size of a given function to within the bounds of the 
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max error provided. Slope and intercept calculated using polyfit. 
Intercept adjusted to balance max positive and negative errors. 
Created by Tom Mack 

Created: July 10, 2006 

Modified: July 11, 2006 





Compute # of segs 


firstderiv = diff (fct)./diff (x); 
secndderiv = diff (firstderiv) ./diff(x(l:length(firstderiv) )); 
[dermax,i] = max(abs(secndderiv) ); 
error = 0; 
loop_stop = 0; 
i_low =i- 1; 

if i_low <= 0 

i_low = 1; 
end 
i_high = i+ 1; 


if i_high > length (fct) 





i_high = length(fct); 

















end 
while error < max_error || loop_stop < length(fct) 

i_low = i_low - 1; 

if i_low <= 0 
i_low = 1; 

end 

i_high = i_high + 1; 

if i_high > length (fct) 
i_high = length(fct); 

end 

p = polyfit (x(i_low:i_high), fect (i_low:i_high),1); 
approx(i_low:i_high) = p(1)*x(i_low:i_high) +p (2); 
errors = approx(i_low:i_high) - fct(i_low:i_high); 
maxposerror = max(errors); 
maxnegerror = min(errors); 
c_Odelta = abs((abs(maxposerror) - abs(maxnegerror))/2); 
if abs(maxnegerror) > abs (maxposerror) 

c_Odelta = -1 * c_Odelta; 

end % if 
approx (i_low:i_high) = approx(i_low:i_high) - c_Odelta; 
errors = approx(i_low:i_high) - fct(i_low:i_high); 
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error = max(abs(errors)); 

if error > max_error 
i_low = i_low + 1; 
i_high = i_high -1; 

end 

loop_stop = loop_stop + 1; 








end 
segsize = i_high - i_low; 
consegs = ceil(length(fct)/segsize) ; 








%& Determine slope & intercept of segments 








1ldx=1; 
for i = l:consegs 
indx(i) = round( (length (x) /consegs) *i); 
if indx(i) == 0 
indx(i) = 1; 
end 
if i==consegs 
indx(i) = length(x); 
end 
endpt (i) = x(indx(i)); 
p = polyfit (x(idx:indx(i)),fct (idx:indx(i)),1); 
approx (idx:indx(i)) = p(1)*x(idx:indx(i))+p(2); 
errors = approx(idx:indx(1i)) - fect (idx:indx(i)); 
maxposerror = max(errors); 
maxnegerror = min(errors); 
c_Odelta = abs(abs(maxposerror) - abs(maxnegerror) )/2; 
if abs(maxnegerror) > abs (maxposerror) 
c_Odelta = -1 * c_Odelta; 
end % if 
% Intercept shift to balance pos & neg error 
cl(i) = p(l); cO(i) = p(2)- c_Odelta; idx = indx(i)+1; 
i = itl; 
end 


F. FIXED-POINT DECIMAL TO BINARY 


function [binfp] = dec2binfp(x,n) 


fe) 


% Function converts a decimal number to a fixed point binary number 
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oe 





with one integer followed by n points to the right of the decimal 


Created by Tom Mack 
Last modified: August 22, 2006 





Inputs 

x = decimal number to be converted 

n (optional, default 9) = bits resolution to the right and left of 
Outputs 


binfp = binary floating point 


if nargin < 2, n = 9; end 

if isnan(x) == 1, 
binfp = NaN; 
return 

elseif x == Inf 
binfp = Inf; 
return 

elseif x < 0, 

x = (x * 24n) + 2% (2*(n-1)); 
x = dec2bin(x); 
x = str2num(x); 
x = x / 10%n; 
binfp = x; 
return 

else 

x = x * 2°n; 
x = dec2bin(x,18); 
xX = str2num(x); 
x = x / 10%n; 
binfp = x; 

end 


(does not have to be an integer) 


representation 
Negative inputs are output in 16-bit (7.9) 
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format 


% 


decimal point 


THIS PAGE INTENTIONALLY LEFT BLANK 


62 


APPENDIX B. VHDL SOFTWARE CODE 


A. NFG TOP-LEVEL VHDL CODE 
-- Copyright (c) 1995-2003 Xilinx, Inc. 
-- All Right Reserved. 


-/ N / 

--/__/\/ Vendor: Xilinx 

--\ \ V_ Version : 6.3.031 

- \ \ Application : 

-- / / Filename : signedfct.vhf 

--/__/ /\ Timestamp : 01/31/2007 15:57:55 
--\ \/\ 

-\ W_\ 

--Command: 

--Design Name: FD16CE_MXILINX_signedfct 


library ieee; 

use ieee.std logic 1164.ALL; 
use 1eee.numeric_std.ALL; 

-- synopsys translate_off 

library UNISIM; 

use UNISIM.Vcomponents.ALL; 
-- synopsys translate_on 


entity FD16CE_MXILINX_ signedfct is 
port(C :in_ std_logic; 
CE :in_ std_logic; 
CLR: in — std_logic; 
D :in_ std_logic_ vector (15 downto 0); 
Q :out std logic vector (15 downto 0)); 
end FD16CE_MXILINX signedfct; 


architecture BEHAVIORAL of FD16CE_ MXILINX signedfct is 
attribute INIT : String ; 
attribute BOX_TYPE :: string ; 
component FDCE 
-- synopsys translate_off 
generic( INIT : bit := '0'); 
-- synopsys translate_on 
port(C :in— std_logic; 
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CE :in__ std_logic; 
CLR: in std_logic; 
D :in_ std_logic; 
Q :out std_logic); 
end component; 
attribute INIT of FDCE : component is "0"; 
attribute BOX TYPE of FDCE : component is "BLACK BOX"; 


begin 
I_ QO : FDCE 
port map (C=>C, 
CE=>CE, 
CLR=>CLR, 
D=>D(0), 
Q=>Q(0)); 


I_Q1 : FDCE 
port map (C=>C, 
CE=>CE, 
CLR=>CLR, 
D=>D(1), 
Q==QU)); 


I_Q2: FDCE 
port map (C=>C, 
CE=>CE, 
CLR=>CLR, 
D=>D(2), 
Q==Q(2)); 


I_Q3 : FDCE 
port map (C=>C, 
CE=>CE, 
CLR=>CLR, 
D=>D(3), 
Q=>Q(3)); 


I_Q4: FDCE 
port map (C=>C, 
CE=>CE, 
CLR=>CLR, 
D=>D(4), 
Q==Q(4)); 


I_Q5: FDCE 
port map (C=>C, 


64 


CE=>CE, 
CLR=>CLR, 
D=>D(5), 
Q=>Q(5)); 


I_Q6: FDCE 
port map (C=>C, 
CE=>CE, 
CLR=>CLR, 
D=>D(6), 
Q==Q(6)); 


I_Q7: FDCE 
port map (C=>C, 
CE=>CE, 
CLR=>CLR, 
D=>D(7), 
QQ); 


I_Q8 : FDCE 
port map (C=>C, 
CE=>CE, 
CLR=>CLR, 
D=>D(8), 
Q=>Q(8)); 


I_Q9 : FDCE 
port map (C=>C, 
CE=>CE, 
CLR=>CLR, 
D=>D(9), 
Q==Q(9)); 


I_Q10: FDCE 
port map (C=>C, 
CE=>CE, 
CLR=>CLR, 
D=>D(10), 
Q==Q(10)); 


I_ Q1l1: FDCE 
port map (C=>C, 
CE=>CE, 
CLR=>CLR, 
D=>D(11), 
Q==QU11)); 
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I_Q12: FDCE 
port map (C=>C, 
CE=>CE, 
CLR=>CLR, 
D=>D(12), 
Q=-Q(12)); 


I_Q13 : FDCE 
port map (C=>C, 
CE=>CE, 
CLR=>CLR, 
D=>D(13), 
Q=>Q(13)); 


I_Q14: FDCE 
port map (C=>C, 
CE=>CE, 
CLR=>CLR, 
D=>D(14), 
Q=>Q(14)); 


I_Q15 : FDCE 
port map (C=>C, 
CE=>CE, 
CLR=>CLR, 
D=>D(15), 
Q=-Q(15)); 


end BEHAVIORAL; 


-- Copyright (c) 1995-2003 Xilinx, Inc. 
-- All Right Reserved. 


-/ MN / 

--/__/\/ Vendor: Xilinx 

--\ \ V_ Version : 6.3.031 

- \ \ Application : 

-- / / Filename : signedfct.vhf 

--/__/ /\ Timestamp : 01/31/2007 15:57:55 
--\ \/\ 

oe WN 
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--Command: 
--Design Name: signedfct 


library ieee; 

use ieee.std logic 1164.ALL; 
use ieee.numeric_std.ALL; 

-- synopsys translate_off 

library UNISIM; 

use UNISIM.Vcomponents.ALL; 
-- synopsys translate_on 


entity signedfct is 
port( CLK: in — std_logic; 
X :in_ std logic vector (14 downto 0); 
FX : out std logic vector (14 downto 0)); 
end signedfct; 


architecture BEHAVIORAL of signedfct is 

attribute BOX_TYPE : string ; 
attribute HU_SET  : string; 
signalCE — : std_logic; 
signal CLR _ : std_logic; 
signal GND1 _ : std_logic; 
signalINT1 : std_logic_ vector (15 downto 0); 
signal Prod: std_logic_ vector (15 downto 0); 
signal XLXN_65 : std logic vector (15 downto 0); 
signal XLXN_66 : std_logic vector (15 downto 0); 
signal XLXN_73 : std_logic vector (15 downto 0); 
signal XLXN_86 : std logic vector (15 downto 0); 
signal XLXN_ 134: std_logic_vector (31 downto 16); 
signal XLXN_135 : std_logic_vector (15 downto 0); 
component VCC 

port (P: out std_logic); 
end component; 
attribute BOX TYPE of VCC : component is "BLACK BOX"; 








component GND 
port (G: out std_logic); 
end component; 
attribute BOX _TYPE of GND : component is "BLACK _ BOX"; 


component FD16CE_MXILINX_signedfct 
port(C :in std_logic; 
CE :in_ std_logic; 
CLR: in std_logic; 
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D :in_ std_logic_ vector (15 downto 0); 
Q :out std logic vector (15 downto 0)); 
end component; 


component mult16x16s 
port(M :in_ std logic vector (15 downto 0); 
X :in_ std_logic vector (15 downto 0); 
MX: out std logic vector (15 downto 0)); 
end component; 


component adder16x16s 
port( X:in_ std_logic vector (15 downto 0); 
Y:in_ std_logic vector (15 downto 0); 
S:out std logic vector (14 downto 0)); 
end component; 


component slopeintlu 
port (x :in std_logic vector (15 downto 0); 
slope :out std logic vector (15 downto 0); 
intercept : out std logic vector (15 downto 0)); 
end component; 


attribute HU_SET of XLXL_43 : label is "XLXI_43_ 4"; 
attribute HU_SET of XLXI_44 : label is "XLXI_44 0"; 
attribute HU_SET of XLXI_46 : label is "XLXI_46_1"; 
attribute HU_SET of XLXI_47 : label is "XLXI_47_2"; 
attribute HU_SET of XLXI_48 : label is "XLXI_48 3"; 
begin 

XLXI 41 : VCC 

port map (P=>CE); 


XLXI_42 : GND 
port map (G=>CLR); 


XLXI_43 : FD16CE_MXILINX_ signedfct 
port map (C=>CLK, 
CE=>CE, 
CLR=>CLR, 
D(15 downto 1)=>X(14 downto 0), 
D(0)=>GNDI1, 
Q(15 downto 0)=>XLXN_73(15 downto 0)); 


XLXI_44 : FD16CE_MXILINX_signedfct 
port map (C=>CLK, 
CE=>CE, 
CLR=>CLR, 
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D(15 downto 0)=>XLXN_66(15 downto 0), 
Q(15 downto 0)=>Prod(15 downto 0)); 


XLXI_46 : FD16CE_MXILINX_signedfct 
port map (C=>CLK, 
CE=>CE, 
CLR=>CLR, 
D(15 downto 0)=>XLXN_135(15 downto 0), 
Q(15 downto 0)=>XLXN_65(15 downto 0)); 


XLXI__47 : FD16CE_MXILINX_signedfct 
port map (C=>CLK, 
CE=>CE, 
CLR=>CLR, 
D(15 downto 0)=>XLXN_86(15 downto 0), 
Q(15 downto 0)=>INT1(15 downto 0)); 


XLXI_48 : FD16CE_MXILINX_signedfct 
port map (C=>CLK, 
CE=>CE, 
CLR=>CLR, 
D(15 downto 0)=>XLXN_134(31 downto 16), 
Q(15 downto 0)=>XLXN_86(15 downto 0)); 


XLXI_49 : mult16x16s 
port map (M(15 downto 0)=>XLXN_65(15 downto 0), 
X(15 downto 0)=>XLXN_73(15 downto 0), 
MX(15 downto 0)=>XLXN_66(15 downto 0)); 


XLXIL_50 : adder16x16s 
port map (X(15 downto 0)=>Prod(15 downto 0), 
Y(15 downto 0)=>INT1(15 downto 0), 
S(14 downto 0)=>FX(14 downto 0)); 


XLXI_51:GND 
port map (G=>GND1); 


XLXL_55 : slopeintlu 
port map (x(15 downto 1)=>X(14 downto 0), 
x(0)=>GND1, 
intercept(15 downto 0)=>XLXN_134(31 downto 16), 
slope(15 downto 0)=>XLXN_135(15 downto 0)); 


end BEHAVIORAL; 


69 


B. SLOPE AND INTERCEPT LOOK-UP CODE 


This is the ‘if,then,else’ code used for the sin(zx) where 0 < x < 0.5. It results in 7 
non-uniform segments. Numbers are written in a 7.9 fixed-point binary format (7 digits 


to the left, 9 digits to the right of the decimal point). 











library IEEE 
use IEEE.ST 








,OGIC_1164.ALL; 

use IEEE.ST ,OGIC_ARITH.ALL;use IEEE.STD_LOGIC_UNSIGNED.ALL; 
—-- GENERATED BY MATLAB ROUTINE LinAppxPfit.m --- 
--- Written by Tom Mack, 5/10/2006. Modified 1/16/07. 
--- Segment Encoder outputs the corresponding slope and intercept for 
the segment based 

-—- upon segment endpoints. 
—--- Segendpt and slope and intercept initialized based upon MATLAB 7 
—- segment fct approximation. 
































D_ 
D 
































E 











library UNISIM; 
use UNISIM.VComponents.all; 
entity slopeintlu is 
generic(x_bits:integer:=16; s_bits:integer:=16; 
i_bits:integer:=16; segs:integer:=7); 








Port ( x : in std_logic_vector(x_bits-1 downto 0); 
slope : out std_logic_vector(s_bits-1 downto 0); 
intercept : out std_logic_vector(i_bits-1 downto 0)); 





type ENDPT is array(0 to segs-1) of std_logic_vector(x_bits-1l 
downto 0); 
end slopeintlu; 


architecture Beh of slopeintlu is 
begin 
process (x) 


























































































































variable SEGENDPT:ENDPT; 
begin 
SEGENDPT(0) := "0000000000111110"; 
SEGENDPT(1) := "0000000001100110"; 
SEGENDPT (2) := "0000000010001001"; 
SEGENDPT (3) := "0000000010101001"; 
SEGENDPT (4) := "0000000011000111"; 
SEGENDPT(5) := "0000000011100101"; 
SEGENDPT(6) := "0000000011111111"; 
if x < SEGENDPT(0) then 
slope <= "0000011000100101"; 
intercept <= "0000000000000000"; 
elsif x < SEGENDPT(1) then 
slope <= "0000010101111100"; 
intercept <= "0000000000010100"; 
elsif x < SEGENDPT(2) then 
slope <= "0000010010100100"; 
intercept <= "0000000001000000"; 
elsif x < SEGENDPT(3) then 
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slope <= "0000001110101111"; 

intercept <= "0000000010000010"; 

elsif x < SEGENDPT(4) then 
lope <= "0000001010101000"; 














reept <= "0000000011011001"; 
elsif x < SEGENDPT(5) then 
slope <= "0000000110010100"; 
intercept <= "0000000101000100"; 
lse slope <= "0000000010000100"; 
intercept <= "0000000110111110"; 
end if; 

end process; 

end Beh; 



































C. MULTIPLIER CODE 


library IEEE; 

use IEEE.STD_LOGIC_1164.ALL; 

use IEEE.STD_LOGIC_ARITH.ALL; 
use IEEE.STD_LOGIC_SIGNED.ALL; 


-- 16x16 Signed Multiplier 

-- Created by: Tom Mack, 8/22/06. Modified: 8/26/06. 

-- Intended to multiply a 16-bit (7.9) number (Slope) with a 15-bit (7.9) number (X) 
-- Output is a 16-bit (8.8) number. 


entity mult16x16s is 
Port (M: in std_logic_vector(15 downto 0); 
X : in std_logic_vector(15 downto 0); 
MX : out std logic vector(15 downto 0)); 
end mult16x16s; 


architecture Beh of mult16x16s is 
signal MX32bit : std_logic_vector(31 downto 0); 


begin 
MX32bit(31 downto 0) <= M(15 downto 0) * X(15 downto 0); 
MX(15 downto 0) <= MX32bit(24 downto 9); 

end Beh; 


D. ADDER CODE 


library IEEE; 
use IEEE.STD_LOGIC_1164.ALL; 
use IEEE.STD_LOGIC_ARITH.ALL; 


71 


use IEEE.STD_LOGIC_SIGNED.ALL; 


--library UNISIM; 
--use UNISIM. VComponents.all; 


entity adder16x16s is 
generic (in_bits:integer:=16; out_bits:integer:=15); 
Port ( X : in std_logic_vector(in_bits-1 downto 0); 
Y : in std_logic_vector(in_bits-1 downto 0); 
S : out std logic vector(out_bits-1 downto 0)); 
end adder16x16s; 


architecture Beh of adder16x16s is 
signal S16bit : std_logic_vector(15 downto 0); 
begin 


Sl6bit <= X + Y; 
S <= S16bit(15 downto 1); 


end Beh; 
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APPENDIX C. SRC COMPUTER VHDL MACRO IMPLEMENTATION CODE3 


A. C IMPLEMENTATION CODE USING MATH LIBRARY 


1. Main.c 


static char const cvsid[] = "SId: main.c,v 2.1 2005/06/14 22:16:46 jls Exp $"; 


#include <libmap.h> 
#include <stdlib.h> 
#include <math.h> 
#include <time.h> 





#define SZ 65536 
void subr (float*, float*, int, int64_t*, int); 


int main (int argc, char *argv[]) { 
int i, num; 
float *A, *D, HIGH, LOW, *F; 
int64_t tm; 
int mapnum = 0; 


if (argc < 2) { 
fprintf (stderr, "need number of elements (up to %d) as arg\n", SZ); 
exit (1); 
} 





if (sscanf (argv[1l], "sd", &num) < 1) { 
fprintf (stderr, "need number of elements (up to %d) as arg\n", SZ); 


3 All SRC code adapted from code developed by SRC Computers, Inc. as provided in the SRC Carte™ Training Exercises, Release 2.1 [19]. 
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if (num > SZ) { 


fprintf (stderr, "need number of elements (up to 


exit (1); 


= (float*) malloc (SZ * sizeof (float)); 











Om PY 
i 











loat*) malloc (SZ * sizeof (float)); 


= (float*) malloc (SZ * sizeof (float)); 





srandom (99); 


HIGH = 0.5; 

LOW = 0.0; 

for (i=0; i<num; i++) { 

Afi] = rand() % 256 * (HIGH - LOW) 
} 





map_allocate (1); 





// call the MAP routine 
subr (A, D, num, &tm, mapnum) ; 








printf ("$lld clocks\n", tm); 


time_t start; 
time_t stop; 


time( &start ); 
For (i=0; i<num; i++) { 
F[i] = sin(A[i]*3.14159); 
} 
time ( &stop ); 
printf( "\t\tTime = %Le nano-seconds\n" 








For (i=0; i<num; i++) { 


74 


/ 256 + LOW; 


, difftime ( 


$d) as arg\n", 


stop, 


start 


) 


SZ); 


* 1000000000) ; 


printf ("Iteration Sd. Sine of (Sf * pi) 
printf (" Sine of (Sf * pi) 
} 
map_free (1); 
exit (0); 
} 
2. Sin.mc 


#include <libmap.h> 


#define SZ 65536 


void subr 
OBM_ BANK_A 
OBM_BANK_D 
int64_t to, 
float v0, 
int. 2; 


DMA_CPU 
wait_DMA 


read_timer 


for (i=0; 


(float A[], 


vi, 


float D 
int64_t, 
int64_t, 


int num, 
(AL, 
(DL, 
eds; 
resO, 


[l, 
SZ) 
SZ) 





resl; 


(CM20BM, AL, MAP_OBM_stripe(1,"A"), 
(0); 


(&t0O); 


i<num/2; it+) { 


read_timer 





split_64to32_flt_fl1t(AL[il, 
resO = sinf (v0*3.14159); 
resl = sinf(v1*3.14159); 
comb_32to064_flt_flt (res0O,resl1,&DL[i]); 
} 


&Vv0, &v1); 


(&t1); 


15 


equals %f MAP-C\n", i, 


A, 


equals %f 


inté64_t *time, 


1, 


C\n", Aflil, 


int mapnum) { 


SZ*sizeof (float), 


0); 


xtime = tl - t0; 
DMA_CPU (OBM2CM, DL, MAP_OBM_stripe(1,"D"), D, 1, SZ*sizeof(float), 0); 


wait_DMA (0); 
} 


B. C IMPLEMENTATION CODE USING IF, THEN, ELSE 


1. Floating Point 


a. Main.c 


#include <libmap.h> 
#include <stdlib.h> 
#include <math.h> 





#define SZ 65536 
void subr (double*, double*, int, int64_t*, int); 
int main (int argc, char *argv[]) { 


int i, num; 
double *A, *D, HIGH, LOW; 





if (arge < 2) { 
fprintf (stderr, "need number of elements (up to %d) as arg\n", SZ); 
exit (1); 








if (sscanf (argv[1l], "sd", &num) < 1) { 


76 


fprintf (stderr, "need number of elements (up to %d) 
exit (1); 
} 


if (num > SZ) { 
fprintf (stderr, "need number of elements (up to %d) 
exit (1); 
} 


A = (doubl 
(doub] 


) malloc (SZ * sizeof (double)); 
) malloc (SZ * sizeof (double)); 

















le* 
lex 


ow) 
ll 





srandom (99); 


= rand() % 256 * (HIGH — LOW) / 256 + LOW; 


i=O; i<num; itt) { 
i] 
Afi] < 0) Ali] = -A[il]; 


map_allocate (1); 











// call the MAP routine 
subr (A, D, num, &tm, mapnum); 





// print results 
printf ("%$lld clocks\n", tm); 


for (i=0; i<num; itt) { 


printf ("Iteration Sd. Sine of (Sf * pi) equals %f 
prince -(" Sine of (Sf * pi) equals Sf 


} 
map_free (1); 


exit (0); 
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c\n" r 


as arg\n", 


as arg\n", 


ITE\n", 


i 


SZ); 


SZ); 


Afil, 


D[il); 





pa 


Afil, 


sin(A[i]*3.14159)); 


b. Sin.mc 


#include <libmap.h> 
#define SZ 65536 


void subr (double A[], double D[], int num, int64_t *time, int mapnum) { 
OBM_ BANK_A (AL, double, SZ) 
OBM_ BANK_D (DL, double, SZ) 
int64_t tO, t1; 
float slope, intercept; 
int, 1; 











DMA_CPU (CM20BM, AL, MAP_OBM_stripe(1,"A"), A, 1, SZ*sizeof (double), 0); 
wait_DMA (0); 


read_timer (&t0); 


for (i=0; i<num; i++) { 

if (AL[i] < 0.121224) { 
slope = 3.07373; 
intercept = 0.00105; 
} 

else if (AL[i] < 0.200940) { 
slope = 2.74354; 
intercept = 0.04085; 
} 

else if (AL[i] < 0.269054) { 
slope = 2.32099; 
intercept = 0.12564; 
} 

else if (AL[i] < 0.331366) { 
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slope = 1.84314; 
intercept = 0.25413; 
} 

else if (AL[i] < 0.390378) { 
slope = 1.32869; 
intercept = 0.42456; 
} 

else if (AL[i] < 0.447590) { 
slope = 0.79036; 
intercept = 0.63469; 
} 

else { 

slope = 0.25817; 
intercept = 0.87262; 
} 

DL[i] = slope * AL[i] + intercept; 


read_timer (&tl); 


xtime = tl - t0; 


DMA_CPU (OBM2CM, DL, MAP_OBM_stripe(1,"D"), D, 1, SZ*sizeof (double), 0); 
wait_DMA (0); 
} 


2. Fixed Point 


a. Main.c 


#include <libmap.h> 
#include <stdlib.h> 
#include <math.h> 
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#define SZ 65536 














void subr (int64_t*, int64 t*, int, int64_t*, int); 
int main (int argc, char *argv[]) { 

int i, num; 

int64_t *A, *D; 

int64_t tm; 

int mapnum = 0; 

if (argc < 2) { 
fprintf (stderr, "need number of elements (up to %d) as arg\n", 
exit (1); 
} 

if (sscanf (argv[1], "sd", &num) < 1) { 
fprintf (stderr, "need number of elements (up to %d) as arg\n", 
exit (1); 
} 

if (num > SZ) f{ 
fprintf (stderr, "need number of elements (up to %d) as arg\n", 
exit (1); 
} 

A = (int64_t*) malloc (SZ * sizeof (int64_t)); 

D = (int64_t*) malloc (SZ * sizeof (int64_t)); 

srandom (99); 

for (i=0; i<num; i++) { 

A[i] = rand() % 256; 

if (A[i] < 0) A[i] = -A[i]; 


map_allocate (1); 
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SZ); 


SZ); 


SZ); 


// call the MAP routine 
subr (A, D, num, &tm, mapnum); 


// print results 


printf ("%$lld clocks\n", tm); 
for (i=0; i<num; i++) { 
float input = A[i] * pow(2,-9); 





float output = D[i] * pow(2,-14); 
printf ("Iteration %d. Sine of 
} 
map_free (1); 
exit (0); 
} 
b. Sin.mc 
#include <libmap.h> 
#define SZ 65536 
void subr (int64_t A[], into64_t D[], 
OBM_BANK_A (AL, int64_t, SZ) 
OBM_BANK_ D (DL, int64_t, SZ) 





int64_t tO, t1; 
int64_t slope, 
int i; 


intercept; 


DMA_CPU 


wait_DMA (0); 


read_timer (&t0); 
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(sf * pi) 


int num, 


(CM20BM, AL, MAP_OBM_stripe(1, 


int64_t *time, 


WAN) » A, 


1, 


equals %f ITE INT\n", i, 


int mapnum) 


SZ*sizeof (int64_t), 


input, 


{ 


0); 


output); 


for (i=0; i<num; i++) { 

if (AL[i] < Oxl1F) { 
slope = 0x312; 
intercept = 0x00; 
} 

else if (AL[i] < 0x33) { 
slope = 0Ox2BE; 
intercept = Ox0A; 
} 

else if (AL[i] < 0x44) {f{ 
slope = 0x252; 
intercept = 0x20; 
} 

else if (AL[i] < 0x54) { 
slope = 0x1D7; 
intercept = 0x41; 
} 

else if (AL[i] < 0x63) { 
slope = 0x154; 
intercept = O0x6C; 
} 

else if (AL[i] < 0x72) { 
slope = Ox0CA; 
intercept = 0xA2; 
} 

else { 

slope = 0x042; 
intercept = OxDF; 
} 

DL[i] = slope * AL[i] + intercept; 





read_timer (&t1l); 
*time = tl - t0; 
DMA_CPU (OBM2CM, DL, MAP_OBM_stripe(1,"D"), D, 1, SZ*sizeof(int64_t), 0); 


wait_DMA (0); 
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C. SRC VHDL MACRO IMPLEMENTATION CODE 


1. SRC C Coding 


a. Main.c 


#include 
#include 
#include 


<libmap.h> 
<stdlib.h> 
<math.h> 





#define SZ 65536 





void subr (int64_t*, int64 t*, int, int64_t*, int); 
int main (int argc, char *argv[]) { 
int i, num; 
int64_t *A, *D; 
int64_t tm; 
int mapnum = 0; 
if (arge < 2) { 
fprintf (stderr, "need number of elements (up to %d) as arg\n", SZ); 
exit (1); 
} 
if (sscanf (argv[1l], "Sd", &num) < 1) { 
fprintf (stderr, "need number of elements (up to %d) as arg\n", SZ); 
exit (1); 


} 


83 


if (num > SZ) { 
fprintf (stderr, "need number of elements (up to %d) as arg\n", SZ); 
exit (1); 
} 


A = (int64_t*) malloc (SZ * sizeof (int64_t)); 
(int64_t*) malloc (SZ * sizeof (int64_t)); 











O 
| 





srandom (99); 


for (i=0; i<num; itt) { 
A[i] = rand() % 512; // 9-bits 
if (A[i] < 0) A[i] = -A[il; // Keeping it positive 





map_allocate (1); 











// call the MAP routine 
subr (A, D, num, &tm, mapnum); 





// print results 
printf ("$lld clocks\n", tm); 


for (i=0; i<num; itt) { 
printf ("Iteration %d. Sine of (%llx * pi) equals %llx VHDL Signed\n", i, 
} 





map_free (1); 


exit (0); 
} 


b. Sine.mc 


#include <libmap.h> 
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#define SZ 65536 


void subr (inté64_t A[], inté64_t D[], int num, inté64_t *time, int mapnum) { 
OBM_BANK_A (AL, int64_t, SZ) 
OBM_BANK_D (DL, int64_t, SZ) 
int64_t tO, t1; 
int. D7, K)- EX} 
void sinfct (int x, int *fx); 





DMA_CPU (CM20BM, AL, MAP_OBM_stripe(1,"A"), A, 1, SZ*sizeof(int64_t), 0); 
wait_DMA (0); 


read_timer (&t0); 


for (i=0; i<num; i++) { 
x = AL[i]; 
sinfct (x, &£x); 
DL[i] = fx; 
} 


read_timer (&tl); 
*time = tl - t0; 
DMA_CPU (OBM2CM, DL, MAP_OBM_stripe(1,"D"), D, 1, $Z*sizeof(int64_t), 0); 


wait_DMA (0); 
} 


Cc. Makefile 





# SId: Makefile,v 2.0.0.1 2005/06/10 23:12:59 hammes Exp $ 
# 
# Copyright 2003 SRC Computers, Inc. All Rights Reserved. 


Manufactured in the United States of America. 


eH HE 
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SRC Computers, Inc. 

4240 N Nevada Avenue 
Colorado Springs, CO 80907 
(v) (719) 262-0213 

(f£) (719) 262-0223 


No permission has been granted to distribute this software 
without the express permission of SRC Computers, Inc. 


This program is distributed WITHOUT ANY WARRANTY OF ANY KIND. 








User defines FILES, MAPFILES, and BIN here 


























FILES = main.c 
MAPFILES = sine.mc 
BIN = sinebin 





Multi chip info provided here 
(Leave commented out if not used) 








PRIMARY = <primary file 1> <primary file 2> 
SECONDARY = <secondary file 1> <secondary file 2> 
CHIP2 = <file to compile to user chip 2> 











User defined directory of code routines 
that are to be inlined 











INLINEDIR = 
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User defined macros info supplied here 


(Leave commented out if not used) 








SRC is case-sensitive, but vhdl code created in Xilinx from schematics will ignore and not use upper-case 
Recommend not using upper-case in macro files / module names / etc 











MACROS = macro/sin.vhd macro/adderl6x1l6s.vhd macro/mult16x1l6s.vhd macro/slopeintlu.vhd 
MY_BLKBOX = macro/blk.v 

MY_NGO_DIR = macro 

MY_INFO = macro/info 





Floating point macros selection 






























































FPMODE = SRC_IEEE_V1 Default SRC version IEEE 
FPMODE = SRC_IEEE_ V2 Size reduced SRC IEEE with 
# special rounding mode 














User supplied MCC and MFTIN flags 






































MY_MCCFLAGS = -log -v 

MY_MF TNF LAGS = -log -v 

# 

# User supplied flags for C & Fortran compilers 

# 

CC = icc icc for Intel cc for Gnu 

FC = ifort ifort for Intel £77 for Gnu 
#LD = ifort for Fortran or C/Fortran mixed 
LD = icc # for C codes 
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MY_CFLAGS 
MY_FFLAGS = 
MY_LDF LAGS 








# Flags to include libs if needed 





VCS simulation settings 
(Set as needed, otherwise just leave commented out) 








USEVCS = 
VCSDUMP = yes # 








yes # YES or yes to use vcs instead of vcsi 
YES or yes to generate vcd+ trace dump 











No modifications are required below 














MAKIN ?= $(MC_ROOT) /opt/srcci/comp/lib/AppRules.make 
include $ (MAKIN) 


2. SRC Macro Files 












































a. Info 

BEGIN_DEF "sinfct" 
MACRO = "Sin"; 
STATEFUL = NO; 
EXTERNAL = NO; 
PIPELINED = YES; 

ATENCY = 2; 

INPUTS = 1 


I0 = INT 32 BITS (X[14:0]) 














OO = INT 32 BITS (FX[14:0]) 
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# 


IN_SIGNAL : 1 BITS "CLK" = "CLOCK"; 











DEBUG_HEADER = # 
void sinfct__dbg (int x, int *fx); 
#; 








DEBUG_FUNC = # 
void sinfct__dbg (int x, int *fx) { 





} 
#; 
END DEF 














b. Blkv 


module sin (CLK, X, FX)/* synthesis syn_black_box */; 
input CLK; 
input [14:0] X; 
output [14:0] FX; 

endmodule 
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APPENDIX D. PC C++ CODE 


J [RRR RRR KKK KK KR KK KR I I I OK I KK 


// File: FetTime.cpp 

// Name: Tom Mack 

// Course: Thesis 

// OS: WinXP Pro 

// Compiler: Visual Studio 2005 
// Date: 7 December 2006 

// Description: Function Time 








// Inputs: NONE 
// Output: 
// Process: 


// Warnings: None. 
J [RRR RRR KKK KK KK KK IR I I I I OR A OK OK KK 


#include <iostream> //Header for I/O 
using std::cout; 
using std::endl; 


#include <ctime> 
using std::time; 





#include <cstdlib> 
using std::rand; 
using std::srand; 








#include <cmath> 
using std::sin; 
sing std::tan; 
sing std::cos; 


c 








c 
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#include <iomanip> 
using std::setprecision; 
using std::fixed; 

using std::scientific; 


int main() 
{ 
double x, fx, timePerCalc; 
double duration, durationl, duration2; 
clock_t start, finishl, finish2; 
int timel, time2; 





double base = 10; 
double exp = 8; 








const int ITERATIONS = pow(base, exp); 


ae 


x = static_cast<double>(( rand() 5000 )) / 10000; 


start = clock(); //Set Start Time 
for (int ix = 1; ix <= ITERATIONS; ix+t) 


{ 
//Do Nothing 





finishl = clock(); //Get End Time 





for (int ix = 1; ix <= ITERATIONS; ix+t) 
{ 
fx = sin(3.14159 * x); 


finish2 = clock(); //Get End Time 





durationl = static_cast<double>(finishl - start) / CLOCKS _PER_SEC; 
duration2 = static_cast<double>(finish2 - finishl) / CLOCKS _PER_SEC; 
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duration = duration2 —- durationl; 























timePerCalc = duration / ITERATIONS; 
cout << "\n*** sin(x) ***\n"; 
cout << "\n\nx: " << x << " fx: " << fx << "\n\n"; 
cout << "Number of iterations: " << ITERATIONS << " = " << base << "e" << exp << "\n\n"; 
/*cout << "Start Time: " << start << "\n"; 
cout << "Finishl Time: " << static_cast<double>(finishl) / CLOCKS_PER_SEC << " seconds\n"; 
cout << "Finish2 Time: " << static_cast<double>(finish2) / CLOCKS_PER_SEC << " seconds\n"; 
cout << "Difference: " << duration << " seconds\n\n"; */ 
cout << "Time per calculation " << timePerCalc << "= " << timePerCalc / 0.000000001 

<< " nanoseconds " << '\n' << endl; 


return 0; 
}//End main () 





93 


THIS PAGE INTENTIONALLY LEFT BLANK 


94 


APPENDIX E. LESSONS LEARNED 


The following is a collection of Lessons Learned while working with the SRC 6 
and related software discussed in this thesis. The intent is to provide future users with a 
reference where they may be able to find potential solutions if encountered with similar 


issues. 


A. FILE NAMING PROBLEMS 


Problem: When you compile your VHDL code using Xilinx’s ISE Navigator, it accepts 
upper and lower case versions of letters as the same. That is, adderVerilog.vhd and 
adderverilog.vhd are the same file to Xilinx’s ISE Navigator. However, files in the 
SRC are case sensitive. That is, adderVerilog.vhd and adderverilog.vhd are 
DIFFERENT files in the SRC-6. So, if you have listed adderverilog.vhd in your 
Makefile as a macro, it will not recognize adderVerilog.vhd as the target file. 
Additionally, if you let Xilinx create VHDL code from a schematic which contains the 
module adderVerilog.vhd it will list refer to the module in the VHDL code as 
adderverilog.vhd. 


Solution: Use lower case letters for ALL files. 


Author: J.T. Butler 
Date: 26 FEB 07 


B. USING THE CONST CONSTRUCT IN C 


Problem: A martello64 error is obtained when using 


int64_t array[5][5] = { {1,2,3,4,5}; 
{6,7,8,9,10}; 
{11,12,13,14,15}; 
{16,17,18,19,20}; 
{21,22,23,24,25} }; 


The error is caused by “too many accesses to BRAM”. 


Background: This is a correct C construct when used on a PC or workstation. 
However, when it is in a .mc file, this declaration will cause a martello64 error. It is 
possibly due to too many accesses to a BRAM (arrays are usually stored in BRAM). 


This was a problem that Scott Bailey experienced. The initial writeup is based on a 
conversation between Scott Bailey and Jon Butler on December 1, 2006 
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Solution: In discussing this with Dave Caliga, Scott learned that the CARTE 2.2 version 
should correct this error. At the time the error occurred, we were using CARTE 2.1. 
Apparently, CARTE 2.2 spaces out the accesses to BRAM so that it can be changed to 
include ALL 25 data values. However, in order to use it in CARTE 2.2, you need to 
declare the array as a constant, like so 


const int64_t array[5][5] = { {1,2,3,4,5}; 
{6,7,8,9,10}; 
{11-12 1:37. 1-4,1 5} 
{16,17,18,19,20}; 
{21,22,23,24,25} } 


The intent of const is to set up a constant array that is not changed in the rest of the 
program, much like a ROM instead of RAM. 


Scott Bailey tried to work around this error by simply defining the array without 
populating it with initial values, using, forexample: int64_t array[5][5]; The 
compiler accepted this. He then put the desired values into array using for loops. 
These arrays will then work as normal C arrays within the .mc code. However, this 
decreases performance, since the values placed into the array must come from either 
OBM or streams, access of which will incur a time penalty. Scott believes that the 
problem is in putting too many values into BRAM too quickly. In a dialog with Dave 
Caliga (SRC Computers), Dave said that the problem occurs when there are more than 8 
initialized values placed in the array. Scott believes that this problem will occur in 
BOTH CARTE 2.1 and 2.2 for non-constant BRAM arrays. 


Author: J.T. Butler 
Date: 26 FEB 07 


C, INCORRECT ARGUMENTS IN SYSTEM SUPPLIED MACROS 


Problem: A core dump occurs when the call-by-value and call-by-reference conventions 
are not adhered to 


popcount_64(int64_t a, int array[i]) 
Instead of an error message, there will be a core dump. 


Background: This was provided by Scott Bailey in a conversation with Jon Butler on 
December 1, 2006. 


Solution: To solve this problem, use the following code. 


popcount_64(int64_t a, &é&temp) 
array[i] = temp; 
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For most system macros, SRC requires that the input values be passed as call-by-value 
(e.g. a) and all output values be done as call-by-reference (e.g. &t emp). 


Author: J.T. Butler 
Date: 26 FEB 07 


D. IF / THEN / ELSE LIMITATION 
Problem: When programming in C within the .mc file (no macro) an error occurs when 
the “If, then, else” chain is too long (approx 26 long). 


Background: This was discovered by Prof. Jon Butler when trying to implement a long 
“if,then,else” string during testing. 


Solution: SRC Carte V2.2 fixes this problem. 


Author: T.J. Mack 
Date: 26 FEB 07 


E. MULTIPLE FILES USED IN A MACRO 
Problem: When using multiple files to describe a circuit in a macro, the SRC won’t 
successfully compile. 


Background: This was discovered while developing the NFG macro where different 
modules are described in separate VHDL files. 


Solution: List all of the VHDL files within the Makefile under macros, separated by a 
space. 


Author: T.J. Mack 
Date: 26 FEB 07 


F. XILINX / SYNPLIFY INCONSISTENCIES 

Problem: VHDL code synthesizes correctly (no errors) in Xilinx XST, but does not in 
Synplify PRO. 

Background: When developing VHDL code for the NFG, the code was originally 
written in the Xilinx ISE. Checking for errors using Xilinx XST resulted in no errors. 


When the code was transported to the SRC, errors resulted. Further troubleshooting 
produced the same errors when using the stand-alone Synplify. 
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Solution: Not all code is universal. Always test code using a stand-alone version of 
Synplify. If it results in errors, the code must be modified. 


Author: T.J. Mack 
Date: 26 FEB 07 


G. MODELSIM AND MULTIPLE HDL’S 


Problem: ModelSim XE (Xilinx Edition) which is obtained for free from the Xilinx 
website does not support multiple HDL’s. 


Background: When developing the NFG, some code was provided by SRC in Verilog. 
When attempting to analyze the circuit with a test bench, an error occurred in ModelSim. 
The error stated that ModelSim XE does not support multiple HDL’s. 


Solution: Download ModelSim SE. NPS has a license. Details available from Dan 
Zulaica. 


Author: T.J. Mack 
Date: 26 FEB 07 


H. INITIALIZING MEMORY FROM A SEPARATE FILE 


Problem: Xilinx allows one to synthesize a ROM where the ROM contents are specified 
in a separate file. When transferring the VHDL files to the SRC and synthesizing with 
Synplify, an error results. This is another artifact of problem F. above. 


Background: Because of the potentially large amount of data needed to load into a 
ROM, it is useful to have a separate file with just this data. The HDL must then access 
this data file during synthesis. 


Solution: Problem not completely solved, yet. Some potential solutions are: 


1. Below is a ROM provided by SRC Computers. Written in Verilog, (SRC 
Computer’s preferred language) it is comprised of 32, 4-input, 1-bit output LUTs. It has 
a 32-bit output. It is initialized using a separate .sdc file. 


module MY_ROM ( 
data, 
adr 
i 
output [31:0] data; 
input [3:0] adr; 


ROM16X1 MO ( 
.O (data[0]), 
.AO (adr[0]), 
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A2 (adr[2]) , 
-A3 (adr[3]) 
i 

ROM16X1 M1 ( 

PAO) (data[1l]), 
-AO (adr[0]), 
-Al (adr[1]), 
-A2 (adr[2]) , 
A (adr[3]) 

i 

ee Fill-In Remaining Modules oA 


ROM16X1 M31 


( 

me) (data[31]), 
. AO (adr[0]), 
-Al (adr[1]), 
-A2 (adr[2]), 
A (adr [3]) 
i 

endmodule 











The ROM initialization values are in the .sdc file below. The INITs are 
somewhat cumbersome, since the LUTs are 1-bit wide. So each of the LUTs has one bit 
position for all of the 16 values. The INIT values essentially represent a 32 row by 16 
column matrix. Each column represents one of 16, 32-bit outputs. 

















define_attribute {i1:M0} xc_props "INIT=ba5d" 
define_attribute {i:Ml} xc_props "INIT=8801" 





rad Fill-In Missing Values *** 
define_attribute {i:M31} xc_props "“"INIT=1321" 


This is the most promising example of a ROM with an external file for 
initialization. However, the 1-bit format of the init values makes it difficult to 
implement. 


2. Below is a another ROM example provided by SRC Computers. It uses 
the RAMB16_S18_S18 module which is a 16 Kb Block RAM with two 18-bit outputs 
(16-bits plus 2-bits for parity). It is initialized using the xc_props lines within the 
same file. 


module MY_ROM ( 
din_0O, 
dout_0, 
din_l, 
dout_l, 
adr_0, 
adr_l, 
w_en_0O, 
w_en_l, 
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clk 

i 
input [15:0] din_0; 
output [15:0] dout_0; 
input [15:0] din_1l; 
output [15:0] dout_1l; 
input [9:0] adr_0; 
input [9:0] adr_1l; 
input w_en_0; 
input w_en_l; 
input clk /* synthesis syn_noclockbuf=1 */ ; 


RAMB16_S18_S18 MO ( 


.DOA (dout_0[15:0]), 
. DOB (dout_1[15:0]), 
.DOPA Q, // ignore the parity outputs 
.DOPB (Py // ignore the parity outputs 


-ADDRA (adr_0), 
.ADDRB (adr_1), 





~CLKA (clk), 

.CLKB (clk), 

-DIA (din OT 15:0) )% 

.DIB (din_1[15:0]), 

-DIPA (2'bO), // zero the parity inputs 
-DIPB (2'bO), // zero the parity inputs 
.ENA (1'bl), 

. ENB (1'b1), 

-SSRA (1'bO), 

- SSRB (1'bO), 

.WEA (w_en_0), 

.WEB (w_en_1) 

) /* synthesis 


xc_props="INIT_00=76931fac9dab2b36c248b87d6ae33f9a62d7183a5d5789e4b2d6b441e2411dc7, \ 
INIT_01=09el1llc7ele7achb6f8cac0bb2fc4c8bc2ae3baaab9165cc458e199ch89F51b13, \ 
INIT_02=5£7091a5abb0874df£3e8cb4543a5eb93b0441e9ca4c2b0 fb3d30875cb£29abd5, \ 
INIT_3e=la0bf 9b00ffd21b6210b11dc59ec947be8 6d1llel0de2e980b8bc98 8e26aba269, \ 


eee Fill-In Missing Values *** 


INIT_3f=ac6bd4cd2bf0471f£cb95377922449de5393850a00a57b47800d374d961ldfeb5" */ ; 


endmodule 


oO: The following code is a 16 x 32-bit ROM written in Verilog. It will 
synthesize in Xilinx XST, but not in Synplify PRO. 


module romverlog(input [3:0] raddr, output [31:0] slope_int); 


reg [15:0] mem [31:0]; 


initial 
begin 
Sreadmemb ("memory.mem", mem) ; 
end 
assign slope_int = mem[raddr]; 
endmodule 


The associated memory.mem file is a simple, binary text file with the memory 
initialization values. 
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00000110010001000000000000000000 
00000110001011010000000000000000 
00000101111111110000000000000100 
00000101101110100000000000001100 
00000101011000000000000000011010 





00000100111100010000000000101111 
00000100011100000000000001001101 
00000011110111110000000001110100 
00000011001111110000000010100101 
00000010100100110000000011100001 
00000001110111100000000100100111 


00000001001000010000000101110111 
00000000011000000000000111001111 
00000001110 100000000100100111 
1 
1 


























00000001001000010000000101110111 
00000000011000000000000111001111 




















Author: T.J. Mack 
Date: 26 FEB 07 


I. MACRO LATENCY AND SRC OVERHEAD 


Problem: When implementing a macro, SRC requires additional clocks to accomplish 
overhead operations. The overhead appears to be 5 clock cycles to pass data to a macro 
and an additional 5 clock cycles to receive data from a macro. One would expect a macro 
with a latency of 3 to take a total of 13 clock cycles. However, it takes only 12. The last 
clock cycle is absorbed into the 5 clock cycles needed to receive data from the macro. In 
this case, the /atency in the info file must be set equal to 2, even though the schematic 
may show a latency of 3. 


Background: When developing the NFG, pipeline depth reports for the loop that calls 
the NFG macro were always 10 clock cycles more. 


Solution: No solution. This is a characteristic of the SRC architecture. 


Author: T.J. Mack 
Date: 26 FEB 07 
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APPENDIX F. SRC OUTPUT 


The following is Place and Route Summary Reports from the SRC for various 
functions. The function name, input domain and number of segments is shown. Also, 
sample input to output and total number of clock cycles is shown. 


sine(pi*x) from 0<xs 0.5 
7 segments 


FERRE EAE A A 
FA PLACE AND ROUTE SUMMARY — #4ARHARAARAHAHAHHAHHE 
Number of Slice Flip Flops: 4,058 out of 67,584 6% 


Number of 4 input LUTs: 1,530 out of 67,584 2% 
Number of occupied Slices: 2,425 out of 33,792 7% 
Number of MULT18X18s: 1outof 144 1% 


freq = 90.7 MHz 
FAH 


42 clocks 

Iteration 0. Sine of (14 * pi) equals 3d VHDL Signed 
Iteration 1. Sine of (54 * pi) equals db VHDL Signed 
Iteration 2. Sine of (15 * pi) equals 40 VHDL Signed 
Iteration 3. Sine of (35 * pi) equals 9a VHDL Signed 
Iteration 4. Sine of (31 * pi) equals 90 VHDL Signed 
Iteration 5. Sine of (50 * pi) equals d4 VHDL Signed 
Iteration 6. Sine of (1 * pi) equals 3 VHDL Signed 
Iteration 7. Sine of (5b * pi) equals e5 VHDL Signed 
Iteration 8. Sine of (7e * pi) equals ff VHDL Signed 
Iteration 9. Sine of (19 * pi) equals 4c VHDL Signed 


sine(1rx) from 0S x $2 
28 segments 


EERE AEE AEE PP EP 
FAH PLACE AND ROUTE SUMMARY | #4ARHARAARAHAAAHHAHHE 
Number of Slice Flip Flops: 4,077 out of 67,584 6% 


Number of 4 input LUTs: 1,699 out of 67,584 2% 
Number of occupied Slices: 2,513 out of 33,792 7% 
Number of MULT18X18s: 1outof 144 1% 


freq = 92.1 MHz 
FAH 


42 clocks 

Iteration 0. Sine of (194 * pi) equals 7f(06 VHDL Signed 
Iteration 1. Sine of (1d4 * pi) equals 7f7b VHDL Signed 
Iteration 2. Sine of (15 * pi) equals 40 VHDL Signed 
Iteration 3. Sine of (1b5 * pi) equals 7f33 VHDL Signed 
Iteration 4. Sine of (1b1 * pi) equals 7f2b VHDL Signed 
Iteration 5. Sine of (dO * pi) equals 8d VHDL Signed 
Iteration 6. Sine of (1 * pi) equals 3 VHDL Signed 
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Iteration 7. Sine of (1db * pi) equals 7f8f VHDL Signed 
Iteration 8. Sine of (fe * pi) equals 5 VHDL Signed 
Iteration 9. Sine of (99 * pi) equals f8 VHDL Signed 


sqrt(- In (x)) from 1/256 s x $ 1/4 
144 segments 


EEE A EP 
FAH PLACE AND ROUTE SUMMARY | #4ARHARHARAHAHAHHAHHE 
Number of Slice Flip Flops: 4,067 out of 67,584 6% 


Number of 4 input LUTs: 1,823 out of 67,584 2% 
Number of occupied Slices: 2,567 out of 33,792 7% 
Number of MULT18X18s: 1outof 144 1% 


freq = 88.5 MHz 
FAH 


Iteration 0. Sqrt(- 
Iteration 1. Sqrt(- 
Iteration 2. Sqrt(- 
Iteration 3. Sqrt(- 
Iteration 4. Sqrt(- 
Iteration 5. Sqrt(- 
Iteration 6. Sqrt(- 
Iteration 7. Sqrt(- 
Iteration 8. Sqrt(- 
Iteration 9. Sqrt(- 


sqrt(x), 0 <= x < 2, error = 0.01 
488 segments 
FREE EH EP 
FAH PLACE AND ROUTE SUMMARY — ##AAHAARRAHHAAHHHAHHHE 
Number of Slice Flip Flops: 4,060 out of 67,584 6% 


Number of 4 input LUTs: 2,594 out of 67,584 3% 
Number of occupied Slices: 2,974 out of 33,792 8% 
Number of MULT18X18s: 1outof 144 1% 


freq = 81.3 MHz 
FAH 


42 clocks 

Iteration 0. Sqrt(x) 194 equals 140 VHDL Signed 
Iteration 1. Sqrt(x) 1d4 equals 159 VHDL Signed 
Iteration 2. Sqrt(x) 15 equals 49 VHDL Signed 
Iteration 3. Sqrt(x) 
Iteration 4. Sqrt(x) 1b1 equals 14c VHDL Signed 
Iteration 5. Sqrt(x) dO equals e5 VHDL Signed 
Iteration 6. Sqrt(x) 1 equals f VHDL Signed 
Iteration 7. Sqrt(x) 1db equals 15b VHDL Signed 
Iteration 8. Sqrt(x) fe equals fe VHDL Signed 
Iteration 9. Sqrt(x) 99 equals c5 VHDL Signed 


sqrt(x), 0 <x <2, error=2x 2° 
3330 segments 
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FAH 
FAH PLACE AND ROUTE SUMMARY = #444HHARAHAHAAHHAHHE 
Number of Slice Flip Flops: 4,060 out of 67,584 6% 


Number of 4 input LUTs: 2,581 out of 67,584 3% 
Number of occupied Slices: 2,982 out of 33,792 8% 
Number of MULT18X18s: 1outof 144 1% 


freq = 80.9 MHz 
FAH 


42 clocks 

Iteration 0. Sine of 
Iteration 1. Sine of 
Iteration 2. Sine of 
Iteration 3. Sine of 
Iteration 4. Sine of 
Iteration 5. Sine of 
Iteration 6. Sine of 
Iteration 7. Sine of 
Iteration 8. Sine of 
Iteration 9. Sine of 


194 * pi) equals 140 VHDL Signed 
1d4 * pi) equals 159 VHDL Signed 
15 * pi) equals 49 VHDL Signed 
1b5 * pi) equals 14d VHDL Signed 
1b1 * pi) equals 14c VHDL Signed 
dO * pi) equals e6 VHDL Signed 

1 * pi) equals 10 VHDL Signed 
1db * pi) equals 15b VHDL Signed 
fe * pi) equals fe VHDL Signed 

99 * pi) equals c5 VHDL Signed 


ee ae ee a ee 
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APPENDIX G. SYNPLICITY AREA REPORT 


Below is an example of an Area Report generated by Synplicity during the 
synthesis process. The report shows what resources on the FPGHA are used for each of 
the modules of the NFG circuit. This report is for sin(zx) where 0 <x < 0.5 using 7 
segments. It is interesting to note that the slope and intercept look-up module is 


constructed entirely with LUT’s and without memory. 


Other circuits including the Vx were generated with 2855 segments. The slope 


and intercept look-up module was still constructed entirely with LUT’s. 


HHH START OF AREA REPORT #####[ 


Part: XC2V40CS 144-6 (Xilinx) 


HHtHHHHH Utilization report for Top level view: signedfct ###HHHHH 








SEQUENTIAL ELEMENTS 


88 2K 2g 28 2k 2k 2s 2K 24k 2s 2K 2k 2 2 2 os 2K 2 ok 


Name Total elements Utilization Notes 
REGISTERS 80 100 % 
LATCHES 0 0.0 % 








Total SEQUENTIAL ELEMENTS in the block signedfct: 80 (39.02 % Utilization) 


COMBINATIONAL LOGIC 


88 2A 2k 28 2k 2k 2 2K 2k 2 2 2k 2s 2 2 2s 2K 2 ok 


Name Total elements Utilization Notes 
LUTS 50 100 % 

MUXCY 15 100 % 

XORCY 15 100 % 
MULT18x18/MULT18x18S 1 100 % 








Total COMBINATIONAL LOGIC in the block signedfct: 81 (39.51 % Utilization) 
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MEMORY ELEMENTS 








ok 2k 2 oe fe fe 2k 2 2k 2k ok ok ok ok ok 

Name Total elements Number of bits Utilization Notes 

SYNC RAMS 0 0 0.0 % 

ROMS 0 0 0.0 % 

Total MEMORY ELEMENTS in the block signedfct: 0 (0.00 % Utilization) 
IO PADS 

ok 2k 2k 2 ok ok 


Name Totalelements Utilization Notes 








Total IO PADS in the block signedfct: 31 (15.12 % Utilization) 


HHtHHHHH Utilization report for cell: FDI6CE MXILINX signedfct #####HHH 


Instance path: signedfct.FDI16CE_ MXILINX signedfct 








SEQUENTIAL ELEMENTS 


88 2A 2g 8 2K 2k 2s 2k 2k 2s 2K 2k 2 2K 2 is 2k 2K ok 








Name Total elements Utilization Notes 

REGISTERS 16 20. % 

LATCHES 0 0.0 % 

Total SEQUENTIAL ELEMENTS in the 


signedfct.FD16CE MXILINX signedfct: 16 (7.80 % Utilization) 


COMBINATIONAL LOGIC 


88 2A 2g 28 2k 2k 2 2k 2k 2s 2K 2k 2 2K 2 is 2K 2 ok 


Name Total elements Utilization Notes 
LUTS 0 0.0 % 
MUXCY 0 0.0 % 
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block 


XORCY 0 0.0% 
MULT18x18/MULT18x18S_ 0 0.0 % 








Total COMBINATIONAL LOGIC in the block 
signedfct.FD16CE MXILINX signedfct: 0 (0.00 % Utilization) 








MEMORY ELEMENTS 

ok 2 of 2 oft of of 2 2k ok ok ok ok of ok 

Name Total elements Number of bits Utilization Notes 

SYNC RAMS 0 0 0.0 % 

ROMS 0 0 0.0 % 

Total MEMORY ELEMENTS in the block signedfct.FD16CE MXILINX_signedfct: 
0 (0.00 % Utilization) 

IO PADS 

ok ok of of 2k ok 2k 


Name Totalelements Utilization Notes 








Total IO PADS in the block signedfct.FDI16CE_ MXILINX signedfct: 0 (0.00 % 
Utilization) 


HHHHHHHH Utilization report for cell: FDI6CE MXILINX_ signedfct_1 ####HHHH 
Instance path: signedfct.FDI6CE MXILINX signedfct_1 








SEQUENTIAL ELEMENTS 


2 8 288 28 28 2 2k 2k 2g 246 fe 2s 288 2K 2K 2K 2K 2K ok 








Name Total elements Utilization Notes 

REGISTERS 16 20. % 

LATCHES 0 0.0 % 

Total SEQUENTIAL ELEMENTS in the block 


signedfct.FD16CE MXILINX signedfct_1: 16 (7.80 % Utilization) 
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COMBINATIONAL LOGIC 


88 2A 2g 8 2K 2k 2s 2K 2k 2s 2K 2k 2 2K 2 is 2K 2K ok 








Name Total elements Utilization Notes 

LUTS 0 0.0 % 

MUXCY 0 0.0 % 

XORCY 0 0.0 % 

MULT18x18/MULT18x18S 0 0.0 % 

Total COMBINATIONAL LOGIC in the block 


signedfct.FDI6CE MXILINX signedfct_1: 0 (0.00 % Utilization) 








MEMORY ELEMENTS 

ok 2 2K 2K 2k ok ok ok of of of of 2k ok ok 

Name Total elements Number ofbits Utilization Notes 

SYNC RAMS 0 0 0.0 % 

ROMS 0 0 0.0 % 

Total MEMORY ELEMENTS in the block signedfct.FD16CE MXILINX signedfct_1: 
0 (0.00 % Utilization) 

IO PADS 

ok ok of 2 of ok 2k 


Name Totalelements Utilization Notes 








Total IO PADS in the block signedfct.FDI6CE MXILINX signedfct 1: 0 (0.000 % 
Utilization) 


HHHHHHHH Utilization report for cell: FDI6CE MXILINX_ signedfct_2 ###HHHH 
Instance path: signedfct.FDI6CE MXILINX_ signedfct_2 








SEQUENTIAL ELEMENTS 


2 8 388 28 2g 2k 2k 2k 2k fe fe 2s 28 2g 2 2K 2K 2K ok 


Name Total elements Utilization Notes 


REGISTERS 16 20. % 
LATCHES 0 0.0 % 








Total SEQUENTIAL ELEMENTS in the block 
signedfct.FD16CE MXILINX signedfct 2: 16 (7.80 % Utilization) 


COMBINATIONAL LOGIC 


2 8 388 28 28 2 2k 2k 2k fe fe 2s 28s 2 2 2K 2K 2K ok 








Name Total elements Utilization Notes 

LUTS 0 0.0 % 

MUXCY 0 0.0 % 

XORCY 0 0.0 % 

MULT18x18/MULT18x18S 0 0.0 % 

Total COMBINATIONAL LOGIC in the block 


signedfct.FDI6CE MXILINX._ signedfct_2: 0 (0.00 % Utilization) 


MEMORY ELEMENTS 

ok ok 2k 2k ok 2k ok ok ok ok ok ok ok ok ok 

Name Total elements Number of bits Utilization Notes 
SYNC RAMS. 0 0 0.0 % 

ROMS 0 0 0.0 % 








Total MEMORY ELEMENTS in the block signedfct.FD16CE MXILINX signedfct 2: 
0 (0.00 % Utilization) 


IO PADS 


8 2 28s 2 2 2K 2k 


Name _ Totalelements Utilization Notes 








Total IO PADS in the block signedfct.FDI16CE MXILINX signedfct 2: 0 (0.00 % 
Utilization) 


HHtHHHHH Utilization report for cell: FDI6CE MXILINX signedfct 3 ####+HHH 
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Instance path: signedfct.FDI6CE_ MXILINX signedfct_3 








SEQUENTIAL ELEMENTS 


2 8 288 28 2g 2k 2k 2k 24k fe fk 2s 2s 2K 2 2K 2K 2K ok 














Name Total elements Utilization Notes 

REGISTERS 16 20. % 

LATCHES 0 0.0 % 

Total SEQUENTIAL ELEMENTS in the block 
signedfct.FD16CE_ MXILINX_ signedfct_3: 16 (7.80 % Utilization) 

COMBINATIONAL LOGIC 

ok 2 2 of 2s 2s ofc 2k 2k 2k ok of of of of of of ok 2k 

Name Total elements Utilization Notes 

LUTS 0 0.0 % 

MUXCY 0 0.0 % 

XORCY 0 0.0 % 

MULT18x18/MULT18x18S 0 0.0 % 

Total COMBINATIONAL LOGIC in the block 


signedfct.FDI6CE_ MXILINX_ signedfct_3: 0 (0.00 % Utilization) 








MEMORY ELEMENTS 

ok 2 2K 2 2k 2k ok ok of of 2 of 2k ok ok 

Name Total elements Number of bits Utilization Notes 

SYNC RAMS 0 0 0.0 % 

ROMS 0 0 0.0 % 

Total MEMORY ELEMENTS in the block signedfct.FD16CE MXILINX signedfct_ 3: 
0 (0.00 % Utilization) 

IO PADS 

ok ok 2 2 of ok 2k 


Name _ Totalelements Utilization Notes 
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PADS 0 0.0 % 








Total IO PADS in the block signedfct.FD16CE_ MXILINX signedfct 3: 0 


Utilization) 


(0.00 % 


HHHHHHHH Utilization report for cell: FDI6CE MXILINX signedfct_4 ##HHHH 


Instance path: signedfct.FDI6CE MXILINX_ signedfct_4 








SEQUENTIAL ELEMENTS 


2 8 288 28 2S 2S 2k 2k fe 246 fe 2s 28 2K 2 2K 2K 2K ok 














Name Total elements Utilization Notes 

REGISTERS 16 20. % 

LATCHES 0 0.0 % 

Total SEQUENTIAL ELEMENTS in the block 
signedfct.FD16CE_MXILINX_signedfct_4: 16 (7.80 % Utilization) 

COMBINATIONAL LOGIC 

ok 2 2 of of 2s oie 2k 2k 2k ok ok of 2 2 of ok ok ok 

Name Total elements Utilization Notes 

LUTS 0 0.0 % 

MUXCY 0 0.0 % 

XORCY 0 0.0 % 

MULT18x18/MULTI18x18S 0 0.0 % 

Total COMBINATIONAL LOGIC in the block 


signedfct.FD16CE MXILINX signedfct 4: 0 (0.00 % Utilization) 


MEMORY ELEMENTS 

ok 2k 2k 2k 2k ok ok ok ok 2k ok ok ok ok ok 

Name Total elements Number of bits Utilization Notes 
SYNC RAMS 0 0 0.0 % 

ROMS 0 0 0.0 % 








Total MEMORY ELEMENTS in the block signedfct.FD16CE MXILINX signedfct 4: 


0 (0.00 % Utilization) 
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IO PADS 


2 2s 2s 2K 2 2K 2k 


Name Totalelements Utilization Notes 








Total IO PADS in the block signedfct.FD16CE MXILINX signedfct 4: 0 (0.00 
Utilization) 


HHHHHHHH Utilization report for cell: adderl6xl6s ###HHHHHH 
Instance path: signedfct.adderl16x16s 


% 








SEQUENTIAL ELEMENTS 


38s 2g 8 2A 2k 2s 2K 2k 2 2K 2k 2s 2 2k 2s 2K 2K os ok 


Name Total elements Utilization Notes 
REGISTERS 0 0.0 % 
LATCHES 0 0.0 % 








Total SEQUENTIAL ELEMENTS in the block signedfct.adderl6x16s: 0 (0.00 
Utilization) 


COMBINATIONAL LOGIC 


28 FIC 2 2k 2k fk fe fs 8s 28 2 2 2K 2K 2 2 2k os 2k 


Name Total elements Utilization Notes 
LUTS 16 32.% 

MUXCY 15 100 % 

XORCY 15 100 % 
MULT18x18/MULT18x18S 0 0.0 % 


% 








Total COMBINATIONAL LOGIC in the block signedfct.adderl6xl6s: 46 (22.44 
Utilization) 


MEMORY ELEMENTS 


388 2K 2k 28 2K 2g 2 2k 2s 2 2 2s 2K 2 ok 


114 


% 


Name Total elements Number of bits Utilization Notes 


SYNC RAMS 0 0 0.0 % 
ROMS 0 0 0.0 % 








Total MEMORY ELEMENTS in the block signedfct.adder16x16s: 0 (0.00 % Utilization) 


IO PADS 


28 2K 2 8 2K 2K 2 


Name Totalelements Utilization Notes 








HHtHHHHH Utilization report for cell: multl6xl6s ######HHHH 
Instance path: signedfct.mult16x16s 








SEQUENTIAL ELEMENTS 


2 8 388 28 28 2 2k 2k 2g 246 fe 2s 28s 2 2 2K 2K 2K ok 


Name Total elements Utilization Notes 
REGISTERS 0 0.0 % 
LATCHES 0 0.0 % 








Total SEQUENTIAL ELEMENTS in the block signedfct.mult1 6x 16s: 0 (000 % 
Utilization) 


COMBINATIONAL LOGIC 


2 8 88 28 2S 2S 2k 2k 2k fe fe 2s 28s 2g 2 2K 2K 2K ok 


Name Total elements Utilization Notes 
LUTS 0 0.0 % 

MUXCY 0 0.0 % 

XORCY 0 0.0 % 
MULT18x18/MULT18x18S 1 100 % 








Total COMBINATIONAL LOGIC in the block signedfct.mult16x 16s: 1 (049 % 
Utilization) 
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MEMORY ELEMENTS 


ok 2k 2 oe fe fe 2k 2 2k 2k ok ok ok ok ok 

Name Total elements Number of bits Utilization Notes 
SYNC RAMS 0 0 0.0 % 

ROMS 0 0 0.0 % 








Total MEMORY ELEMENTS in the block signedfct.mult16x16s: 0 (0.00 % Utilization) 


IO PADS 


38 2K 2 88 2K 2K 2 


Name Totalelements Utilization Notes 








HHHHHHHH Utilization report for cell: slopeintlu ####HtHHHH 
Instance path: signedfct.slopeintlu 








SEQUENTIAL ELEMENTS 


88 2A 2g 8 2K 2k 2s 2k 2k 2s 2K 2k 2 2K 2 is 2k 2K ok 


Name Total elements Utilization Notes 


REGISTERS 0 0.0 % 
LATCHES 0 0.0 % 








Total SEQUENTIAL ELEMENTS in the block signedfct.slopeintlu: 0 (0.00 % 
Utilization) 


COMBINATIONAL LOGIC 


88 2A 2g 28 2k 2k 2 2k 2k 2s 2K 2k 2 2K 2 is 2K 2 ok 


Name Total elements Utilization Notes 
LUTS 34 68. % 
MUXCY 0 0.0 % 
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XORCY 0 0.0% 








MULT18x18/MULTI18x18S 0 0.0 % 

Total COMBINATIONAL LOGIC in the block signedfct.slopeintlu: 34 (16.59 % 
Utilization) 

MEMORY ELEMENTS 

ok 2 of 2 oft of of 2 2k ok ok ok ok of ok 

Name Total elements Number of bits Utilization Notes 

SYNC RAMS 0 0 0.0 % 

ROMS 0 0 0.0 % 








Total MEMORY ELEMENTS in the block signedfct.slopeintlu: 0 (0.00 % Utilization) 


IO PADS 


3 2k 2s 2K 2 2K 2k 


Name Totalelements Utilization Notes 








Total IO PADS in the block signedfct.slopeintlu: 0 (0.00 % Utilization) 


Hitttttt END OF AREA REPORT #####] 
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